Text Analytics

Item

What to prioritize? Natural Language Processing for the Development of a Modern Bug Tracking Solution in Hardware Development

( 2022-01-04) Do, Thi Thu Hang ; Dobler, Markus ; Kühl, Niklas

Managing large numbers of incoming bug reports and finding the most critical issues in hardware development is time consuming, but crucial in order to reduce development costs. In this paper, we present an approach to predict the time to fix, the risk and the complexity of debugging and resolution of a bug report using different supervised machine learning algorithms namely Random Forest, Naive Bayes, SVM, MLP and XGBoost. Further, we investigate the effect of the application of active learning and we evaluate the impact of different text representation techniques, namely TF-IDF, Word2Vec, Universal Sentence Encoder and XLNet on the model's performance. The evaluation shows that a combination of text embeddings generated through the Universal Sentence Encoder and MLP as classifier outperforms all other methods, and is well suited to predict the risk and complexity of bug tickets.

Item

Towards Automated Moderation: Enabling Toxic Language Detection with Transfer Learning and Attention-Based Models

( 2022-01-04) Caron, Matthew ; Bäumer, Frederik S. ; Müller, Oliver

Our world is more connected than ever before. Sadly, however, this highly connected world has made it easier to bully, insult, and propagate hate speech on the cyberspace. Even though researchers and companies alike have started investigating this real-world problem, the question remains as to why users are increasingly being exposed to hate and discrimination online. In fact, the noticeable and persistent increase in harmful language on social media platforms indicates that the situation is, actually, only getting worse. Hence, in this work, we show that contemporary ML methods can help tackle this challenge in an accurate and cost-effective manner. Our experiments demonstrate that a universal approach combining transfer learning methods and state-of-the-art Transformer architectures can trigger the efficient development of toxic language detection models. Consequently, with this universal approach, we provide platform providers with a simplistic approach capable of enabling the automated moderation of user-generated content, and as a result, hope to contribute to making the web a safer place.

Item

On the use of Machine Learning and Deep Learning for Text Similarity and Categorization and its Application to Troubleshooting Automation

( 2022-01-04) Couto, Julia ; Tomaz, Laura ; Godoy, Julia ; Kniest, Davi ; Callegari, Daniel ; Meneguzzi, Felipe ; Ruiz, Duncan

Troubleshooting is a labor-intensive task that includes repetitive solutions to similar problems. This task can be partially or fully automated using text-similarity matching to find previous solutions, lowering the workload of technicians. We develop a systematic literature review to identify the best approaches to solve the problem of troubleshooting automation and classify incidents effectively. We identify promising approaches and point in the direction of a comprehensive set of solutions that could be employed in solving the troubleshooting automation problem.

Item

New Threats to Privacy-preserving Text Representations

( 2022-01-04) Zhan, Huixin ; Zhang, Kun ; Hu, Chenyi ; Sheng, Victor

The users’ privacy concerns mandate data publishers to protect privacy by anonymizing the data before sharing it with data consumers. Thus, the ultimate goal of privacy-preserving representation learning is to protect user privacy while ensuring the utility, e.g., the accuracy of the published data, for future tasks and usages. Privacy-preserving embeddings are usually functions that are encoded to low-dimensional vectors to protect privacy while preserving important semantic information about an input text. We demonstrate that these embeddings still leak private information, even though the low dimensional embeddings encode generic semantics. We develop two classes of attacks, i.e., adversarial classification attack and adversarial generation attack, to study the threats for these embeddings. In particular, the threats are (1) these embeddings may reveal sensitive attributes letting alone if they explicitly exist in the input text, and (2) the embedding vectors can be partially recovered via generation models. Besides, our experimental results show that our approach can produce higher-performing adversary models than other adversary baselines.

Item

Generating Vocabulary Sets for Implicit Language Learning using Masked Language Modeling

( 2022-01-04) Edgar, Vatricia ; Bansal, Ajay

A well-balanced language curriculum must include both explicit vocabulary learning and implicit vocabulary learning. However, most language learning applications focus on explicit instruction. Students require support with implicit vocabulary learning because they need enough context to guess and acquire new words. Traditional techniques aim to teach students enough vocabulary to comprehend the text, thus enabling them to acquire new words. Despite the wide variety of support for vocabulary learning offered by learning applications today, few offer guidance on how to select an optimal vocabulary study set. This paper proposes a novel method of student modeling with masked language modeling to detect words that are required for comprehension of a text. It explores the efficacy of using deep learning via a pre-trained masked language model to model human reading comprehension and presents a vocabulary study set generation pipeline (VSGP). Promising results show that masked language modeling can be used to model human comprehension and the pipeline produces reasonably sized vocabulary study sets that can be integrated into language learning systems.

Item

Introduction to the Minitrack on Text Analytics

( 2022-01-04) Cogburn, Derrick ; Hine, Michael ; Peladeau, Normand ; Yoon, Victoria

Text Analytics

Permanent URI for this collection

Browse

Browse

Recent Submissions