Data, Text, and Web Mining for Business Analytics

Item

A Qualitative Literature Review on Linkage Techniques for Data Integration

( 2020-01-07) Kruse, Felix ; Hassan, Ahmad Pajam ; Awick, Jan-Philipp ; Marx Gómez, Jorge

The data linkage techniques ”entity linking” and ”record linkage” get rising attention as they enable the integration of multiple data sources for data, web, and text mining approaches. This has resulted in the development of numerous algorithms and systems for these techniques in recent years. The goal of this publication is to provide an overview of these numerous data linkage techniques. Most papers deal with record linkage and structured data. Processing unstructured data through entity linking is rising attention with the trend Big Data. Currently, deep learning algorithms are being explored for both linkage techniques. Most publications focus their research on a single process step or the entire process of ”entity linking” or ”record linkage”. However, the papers have the limitation that the used approaches and techniques have always been optimized for only a few data sources.

Item

A Partial Least-Squares Regression Model to Measure Parkinson’s Disease Motor States Using Smartphone Data

( 2020-01-07) Memedi, Mevludin ; Aghanavesi, Somayeh

Design choices related to development of data-driven models significantly impact or degrade predictive performance of the models. One of the essential steps during development and evaluation of such models is the choice of feature selection and dimension reduction techniques. That is imperative especially in cases dealing with multimodal data gathered from different sources. In this paper, we will investigate the behavior of Partial Least Squares (PLS) regression for dimension reduction and prediction of motor states of Parkinson’s disease (PD) patients, using upper limb motor data gathered by means of a smartphone. The results in terms of correlations between smartphone-based and clinician-derived scores were compared to a previous study using the same data where principal component analysis (PCA) and support vector machines (SVM) were used. The findings from this study show that PLS is superior in terms of prediction performance of motor states in PD than combining PCA and SVM. This indicates that PLS could be considered as a useful methodology in problems where data-driven analysis is needed.

Item

Improving Prediction Models for Mass Assessment: A Data Stream Approach

( 2020-01-07) Shi, Donghui ; Guan, Jian ; Zurada, Jozef ; Levitan, Alan

Mass appraisal is the process of valuing a large collection of properties within a city/municipality usually for tax purposes. The common methodology for mass appraisal is based on multiple regression though this methodology has been found to be deficient. Data mining methods have been proposed and tested as an alternative but the results are very mixed. This study introduces a new approach to building prediction models for assessing residential property values by treating past sales transactions as a data stream. The study used 110,525 sales transaction records from a municipality in the Midwest of the US. Our results show that a data stream based approach outperforms the traditional regression approach, thus showing its potential in improving the performance of prediction models for mass assessment.

Item

Prevent Low-Quality Analytics by Automatic Selection of the Best-Fitting Training Data

( 2020-01-07) Kiefer, Cornelia ; Reimann, Peter ; Mitschang, Bernhard

Data analysis pipelines consist of a sequence of various analysis tools. Most of these tools are based on supervised machine learning techniques and thus rely on labeled training data. Selecting appropriate training data has a crucial impact on analytics quality. Yet, most of the times, domain experts who construct analysis pipelines neglect the task of selecting appropriate training data. They rely on default training data sets, e.g., since they do not know which other training data sets exist and what they are used for. Yet, default training data sets may be very different from the domain-specific input data that is to be analyzed, leading to low-quality results. Moreover, these input data sets are usually unlabeled. Thus, information on analytics quality is not measurable with evaluation metrics. Our contribution comprises a method that (1) indicates the expected quality to the domain expert while constructing the analysis pipeline, without need for labels and (2) automatically selects the best-fitting training data. It is based on a measurement of the similarity between input and training data. In our evaluation, we consider the part-of-speech tagger tool and show that Latent Semantic Analysis (LSA) and Cosine Similarity are suited as indicators for the quality of analysis results and as basis for an automatic selection of the best-fitting training data.

Item

Use of Conventional Machine Learning to Optimize Deep Learning Hyper-parameters for NLP Labeling Tasks

( 2020-01-07) Gu, Yang ; Leroy, Gondy

Deep learning delivers good performance in classification tasks, but is suboptimal with small and unbalanced datasets, which are common in many domains. To address this limitation, we use conventional machine learning, i.e., support vector machines (SVM) to tune deep learning hyper-parameters. We evaluated our approach using mental health electronic health records in which diagnostic criteria needed to extracted. A bidirectional Long Short-Term Memory network (BI-LSTM) could not learn the labels for the seven scarcest classes, but saw an increase in performance after training with optimal weights learned from tuning SVMs. With these customized class weights, the F1 scores for rare classes rose from 0 to values ranging from 18% to 57%. Overall, the BI-LSTM with SVM customized class weights achieved a micro-average of 47.1% for F1 across all classes, an improvement over the regular BI-LSTM’s 45.9%. The main contribution lies in avoiding null performance for rare classes.

Item

Topic Modeling and Transfer Learning for Automated Surveillance of Injury Reports in Consumer Product Reviews

( 2020-01-07) Goldberg, David ; Zaman, Nohel

Many modern firms and interest groups are tasked with the challenge of monitoring the status and performance of a bevy of distinct products. As online user-generated content has increased in volume, new unstructured data sources are available for mining unique insights. Reports of injuries arising as a result of product usage are particularly concerning. In this paper, we utilize complimentary approaches to address this problem. We analyze two novel datasets; first, a government-maintained dataset of hazard and injury reports and second, a large dataset of cross-industry consumer product reviews manually coded for the presence of hazard and injury reports. We apply an unsupervised topic modeling approach to characterize the hazard and injury reports detected. Then, we implement a supervised transfer learning technique, using information obtained from the government-maintained dataset to detect hazard and injury reports in online reviews. Our results offer improved surveillance for monitoring hazards across multiple industries.

Item

Introduction to the Minitrack on Data, Text, and Web Mining for Business Analytics

( 2020-01-07) Delen, Dursun ; Davazdahemami, Behrooz ; Zolbanin, Hamed

Data, Text, and Web Mining for Business Analytics

Permanent URI for this collection

Browse

Browse

Recent Submissions