Text Mining and Analytics

Permanent URI for this collection

Browse

Recent Submissions

Now showing 1 - 5 of 5
  • Item
    Synthetic APIs: Enabling Language Models to Act as Interlocutors Between Natural Language and Code
    (2023-01-03) Mullins, Ryan; Terry, Michael
    Large language models (LLMs) can synthesize code from natural language descriptions or by completing code in-context. In this paper, we consider the ability of LLMs to synthesize code, at inference time, for a novel API not in its training data, and specifically examine the impact of different API designs on this ability. We find that: 1) code examples in model training data seem to facilitate API use at inference time; 2) hallucination is the most common failure mode; and 3) the designs of both the novel API and the prompt affect performance. In light of these findings, we introduce the concept of a Synthetic API: an API designed to be used by LLMs instead of by humans. Synthetic APIs for LLMs offer the potential to further accelerate development of natural language interfaces to arbitrary tools and services.
  • Item
    Introduction to the Minitrack on Text Mining and Analytics
    (2023-01-03) Cogburn, Derrick; Hine, Michael; Yoon, Victoria
  • Item
    Hey Article, What Are You About? Question Answering for Information Systems Articles through Transformer Models for Long Sequences
    (2023-01-03) Ebert, Louisa; Huettemann, Sebastian; Mueller, Roland M.
    Question Answering (QA) systems can significantly reduce manual effort of searching for relevant information. However, challenges arise from a lack of domain-specificity and the fact that QA systems usually retrieve answers from short text passages instead of long scientific articles. We aim to address these challenges by (1) exploring the use of transformer models for long sequence processing, (2) performing domain adaptation for the Information Systems (IS) discipline and (3) developing novel techniques by performing domain adaptation in multiple training phases. Our models were pre-trained on a corpus of 2 million sentences retrieved from 3,463 articles from the Senior Scholars' Basket and fine-tuned on SQuAD and a manually created set of 500 QA pairs from the IS field. In six experiments, we tested two transfer learning techniques for fine-tuning (TANDA and FANDO). The results show that fine-tuning with task-specific domain knowledge considerably increases the models' F1- and Exact Match-scores.
  • Item
    Boosting Factual Consistency and High Coverage in Unsupervised Abstractive Summarization
    (2023-01-03) Huang, Yen-Hao; Kuo, Chin-Ting; Chen, Yi-Shin
    Abstractive summarization has gained attention because of the positive performance of large-scale, pretrained language models. However, models may generate a summary that contains information different from the original document. This phenomenon is particularly critical under the abstractive methods and is known as factual inconsistency. This study proposes an unsupervised abstractive method for improving factual consistency and coverage by adopting reinforcement learning. The proposed framework includes (1) a novel design to maintain factual consistency with an automatic question-answering process between the generated summary and original document, and (2) a novel method of ranking keywords based on word dependency, where keywords are used to examine the coverage of the key information preserved in the summary. The experimental results show that the proposed method outperforms the reinforcement learning baseline on both the evaluations for factual consistency and coverage.
  • Item
    Assessing Text Representation Methods on Tag Prediction Task for StackOverflow
    (2023-01-03) Skenderi, Erjon; Laaksonen, Salla-Maaria; Stefanidis, Kostas; Huhtamäki, Jukka
    A large part of knowledge evolves outside of the operations of an organization. Question and answer online social platforms provide an important source of information to explore the underlying communities. StackOverflow (SO) is one of the most popular question and answer platforms for developers, with more than 23 million questions asked. Organizing and categorizing data is crucial to manage knowledge in such large quantities. Questions posted on SO are assigned a set of tags and textual content of each question may contain coding syntax. In this paper, we evaluate the performance of multiple text representation methods in the task of predicting tags for SO questions and empirically prove the impact of code syntax in text representations. The SO dataset was sampled and questions without code syntax were identified. Two classical text representation methods consisting of BoW and TF-IDF were selected along four other methods based on pre-trained models including Fasttext, USE, Sentence-BERT and Sentence-RoBERTa. Multi-label k'th Nearest Neighbors classifier was used to learn and predict tags based on the similarities between feature-vector representations of the input data. Our results indicate a consistent superiority of the representations generated from Sentence-RoBERTa. Overall, the classifier achieved a 17% or higher improvement on F1 score when predicting tags for questions without any code syntax in content.