Collaboration for Data Science
Permanent URI for this collection
1 - 5 of 5
ItemDissecting Moneyball: Improving Classification Model Interpretability in Baseball Pitch Prediction( 2020-01-07)Data science, where technical expertise meets do-main knowledge, is collaborative by nature. Complex machine learning models have achieved human-level performance in many areas, yet they face adoption challenges in practice due to limited interpretability of model outputs, particularly for users who lack specialized technical knowledge. One key question is how to unpack complex classification models by enhancing their interpretability to facilitate collaboration in data science research and application. In this study, we extend two state-of-the-art methods for drawing fine-grained explanations from the results of classification models. The main extensions include aggregating explanations from individual instances to a user-defined aggregation level, and providing explanations with the original features rather than engineered representations. We use the prediction of baseball pitch outcome as a case to evaluate our extended methods. The experiment results of the methods with real sensor data demonstrate their improved interpretability while pre-serving superior prediction performance.
ItemMetrics for Analyzing Social Documents to Understand Joint Work( 2020-01-07)Social Collaboration Analytics (SCA) aims at measuring and interpreting communication and joint work on collaboration platforms and is a relatively new topic in the discipline of Information Systems. Previous applications of SCA are largely based on transactional data (event logs). In this paper, we propose a novel approach for the examination of collaboration based on the structure of social documents. Guided by the ontology for social business documents (SocDOnt) we develop metrics to measure collaboration around documents that provide traces of collaborative activity. For the evaluation, we apply these metrics to a large-scale collaboration platform. The findings show that group workspaces that support the same use case are characterized by a similar richness of their social documents (i.e. the number of components and contributing authors). We also show typical differences in the “collaborativity” of functional modules (containers).
ItemCrowdsourcing Data Science: A Qualitative Analysis of Organizations’ Usage of Kaggle Competitions( 2020-01-07)In light of the ongoing digitization, companies accumulate data, which they want to transform into value. However, data scientists are rare and organizations are struggling to acquire talents. At the same time, individuals who are interested in machine learning are participating in competitions on data science internet platforms. To investigate if companies can tackle their data science challenges by hosting data science competitions on internet platforms, we conducted ten interviews with data scientists. While there are various perceived benefits, such as discussing with participants and learning new, state of the art approaches, these competitions can only cover a fraction of tasks that typically occur during data science projects. We identified 12 factors within three categories that influence an organization’s perceived success when hosting a data science competition.
ItemWeSAL: Applying Active Supervision to Find High-quality Labels at Industrial Scale( 2020-01-07)Obtaining hand-labeled training data is one of the most tedious and expensive parts of the machine learning pipeline. Previous approaches, such as active learning aim at optimizing user engagement to acquire accurate labels. Other methods utilize weak supervision to generate low-quality labels at scale. In this paper, we propose a new hybrid method named WeSAL that incorporates Weak Supervision sources with Active Learning to keep humans in the loop. The method aims to generate large-scale training labels while enhancing its quality by involving domain experience. To evaluate WeSAL, we compare it with two-state-of-the-art labeling techniques, Active Learning and Data Programming. The experiments use five publicly available datasets and a real-world dataset of 1.5M records provided by our industrial partner, IBM. The results indicate that WeSAL can generate large-scale, high-quality labels while reducing the labeling cost by up to 68% compared to active learning.