Data Analytics, Data Mining, and Machine Learning for Social Media

Permanent URI for this collection

Browse

Recent Submissions

Now showing 1 - 10 of 10
  • Item
    From PogChamps to Insights: Detecting Original Content in Twitch Chat
    (2025-01-07) Lindroos, Jari; Peltonen, Jaakko; Välisalo, Tanja; Koskimaa, Raine; Toivanen, Ida
    The vast volume of chat messages generated during esports events on Twitch represents a valuable source of data for understanding audience behavior. However, the sheer quantity and dynamic nature of this data make manual analysis impractical. This study addresses this challenge by introducing FinTwitchBERT, a model fine-tuned to classify Twitch chat messages into four categories based on their uniqueness. Our model demonstrates the ability to distinguish between original content, repetitive messages such as emote spamming, formulaic messages, and interactive commands chat participants use to interact with channel bots. Pre-trained on over 18 million Finnish Twitch chat messages and utilizing a combination of semi-supervised learning and iterative pseudo-labeling with human-in-the-loop validation, FinTwitchBERT achieves 97.42% accuracy on a test set of unseen chat messages with a limited initial dataset of only 7,529 manually annotated messages.
  • Item
    A Comparative Analysis of Reddit Discussions on Meat Reduction in Portugal, Poland, and the United Kingdom
    (2025-01-07) Roszczynska-Kurasinska, Magdalena; Biesaga, Mikolaj; Alves De Oliveira, Carolina
    In recent years, diets that minimize or exclude meat and animal-derived products have garnered considerable attention and popularity in the Western world. This study seeks to analyze Reddit discussion data to investigate the motivations, opportunities, and capabilities involved in adopting plant-based, vegetarian, or reduced meat consumption diets in three European countries: Poland, Portugal, and the United Kingdom. Considering all 1,151 comments from the three Reddit threads, motivations were the most frequently discussed COM-B component, with 291 comments (25%), followed by opportunities with 195 comments (17%). Capabilities were the least mentioned component, appearing in 90 comments (8%). The study highlights and contextualizes the similarities and differences in barriers and enabler across the countries, considering their unique cultural, economic, and social environment.
  • Item
    Fine-grained Feature Fusion Framework for Online Crowdfunding Success Prediction
    (2025-01-07) Zhao, Yiming; Deng, Chaoqun; Gao, Qiang
    Crowdfunding provides a new solution for individuals and companies to overcome financial hardships. However, how to improve the success rate of crowdfunding remains a challenge for project initiators. Pre-launch crowdfunding success prediction allows the initiator to understand the likelihood of crowdfunding success and then adjust the project information based on the result. Previous deep learning-based pre-launch crowdfunding success prediction models mainly focused on improving model performances by applying cutting-edged AI models or algorithms. These methods ignored the fine-grained features contained in projects and platforms that cannot be recognized by pre-trained encoders. In this study, we use speech act theory to recognize linguistic patterns in project descriptions. We will also apply contestable market theory to capture the fine-grained regional features as well as competition intensity on the platform and use social exchange theory to add key information that may affect donors’ decisions in impact letters to the framework. The experiment results demonstrate the effectiveness of using the fine-grained features in pre-launch crowdfunding success prediction.
  • Item
    Unpacking Algorithmic Bias in YouTube Shorts by Analyzing Thumbnails
    (2025-01-07) Cakmak, Mert Can; Agarwal, Nitin
    As digital platforms increasingly shape our online experiences, the influence of recommendation algorithms on user behavior becomes ever more significant. This research delves into the biases inherent in YouTube Shorts' recommendation algorithms by analyzing the topical content of thumbnails through captions generated by advanced generative AI models, specifically GPT and Llama. Employing topic modeling and clustering techniques, we scrutinized a substantial dataset of YouTube Shorts to uncover patterns of bias within the recommendation process. Our findings reveal a significant drift in recommended content from serious geopolitical topics to broader, entertainment-focused themes, underscoring the impact of algorithmic preferences on user engagement. This study highlights the necessity for greater transparency and fairness in content recommendation systems, offering valuable insights into the ethical implications of algorithmic bias in digital media.
  • Item
    How Are Employees Using Collaboration Software to Support Their Work? A Method for Analyzing Digital Traces in Enterprise Collaboration Systems
    (2025-01-07) Schubert, Petra; Williams, Susan; Just, Martin; Alberts, Jens; Bahles, Sebastian
    In this study, we present a novel method for investigating the digital traces of collaborative user activity in large-scale Enterprise Collaboration Systems (ECS). Guided by existing research, we developed a classification metric (Collaborative Work Codes) to describe the type of work that can be identified in the event logs of collaboration software. Following a Design Science Research approach, we developed a computational technique that assigns the codes to event records based on a mapping table. In two evaluation cycles, the computational technique was applied to two ECS datasets (the first provided by a research group, the second by a large German manufacturing company). The combined data was imported into a dashboard and used to evaluate the coding method and the suitability of the codes for analysis. The findings show that the codes appropriately reflect the type of work carried out by the users.
  • Item
    Beyond Privacy: Understanding and Mitigating Doxing in the Digital Environment
    (2025-01-07) Bottlapally, Tharun; Zhou, Lina
    The proliferation of social media has transformed digital communication, offering unprecedented connectivity while also exposing individuals to privacy risks like doxing—the malicious disclosure of private information. This study investigates doxing through the lenses of digital ethics and cyber safety, seeking to develop effective mitigation strategies. Among the various countermeasures to online doxing, including technical solutions, legal reforms, and educational initiatives, this study focuses on technical solutions by leveraging transform-based models such as BERT and RoBERTa to detect doxing activities in textual data. Our empirical results demonstrate the effectiveness of the transformer-based models in detecting online doxing behavior. Additionally, this study discusses the socio-ethical implications of doxing, taking into account the different stakeholders involved. Using a multidisciplinary approach, this study offers strategies to strengthen digital security and ethics, fostering a safer online environment.
  • Item
    A Comparative Evaluation of the SIR and SEIZ Epidemiological Models to Describe the Diffusion Characteristics of COVID-19 Polarizing Viewpoints on Online Social Networks
    (2025-01-07) Maleki, Maryam; Agarwal, Nitin
    To understand and characterize the diffusion trends of opposing viewpoints on Twitter, we applied two epidemiological models to six datasets related to COVID-19. We compared the results of the SIR (Susceptible, Infected, Recovered) and the SEIZ (Susceptible, Exposed, Infected, Skeptics) epidemiological models. We collected six datasets indicative of polarizing viewpoints related to contentious subjects surrounding the COVID-19 pandemic. Three of the datasets fall into an anti-subject hashtag group, and three fall into a pro-subject hashtag group. The timeframe of each dataset is from January 1, 2020, to the end of 2021. Our findings demonstrate that while both the SIR and the SEIZ models can evaluate the propagation trends of the polarizing viewpoints, the SEIZ model is more accurate with relatively less error compared to the SIR model. This work sets the stage for ultimately leading to the ability to develop methods to prevent the propagation of ideas that lack scientific evidence while promoting the spread of scientifically backed ideas.
  • Item
    The Interplay of Heuristic and Systematic Processing of Information in News Validation: A Natural Language Processing Approach
    (2025-01-07) Boyce, James; Namvar, Morteza; Liu, Yiying; Lyu, Beibei; Mensforth, Martha; Tao, Hsi-Wen (Arvin)
    With the growth of social media, user-generated content has become one of the key sources for news, but verifying its validity remains a challenge for both platforms and users. This study leverages the Heuristic-Systematic Model and in a novel approach employs state-of-the-art Natural Language Processing techniques to develop text features explaining the validity of online news from the user’s perspective. We used a panel dataset of over 130,000 labelled tweets to statistically validate our research model. Our results show that content similarity and loaded language (heuristic features), directly affect the perceived validity of online news. We also demonstrated the direct effect of language intensity and conflicting sentiment (systematic features) and their moderating effect of loaded language on the perceived validity of online news. Our proposed method sheds light on how platforms should analyze text features to evaluate online news credibility, which in turn provides a more reliable online environment.
  • Item
    Unmasking Public Sentiment: A Sample Efficient Approach to Analyzing Twitter Opinion on US Aid to Ukraine
    (2025-01-07) Ghosh, Satanu; Ghosh, Souvick; Dewitt, Nikhil; Mccoy, Denise
    The U.S. aid to Ukraine is a bipartisan topic of extreme socio-political importance. While several organizations have conducted surveys to understand the public stance on this topic, there is no research to date that analyses public opinion on social media, possibly due to the lack of annotated data. Therefore, this research compares several sample-efficient methods (including in-context learning) to analyze tweet sentiments with minimal training data. First, we collect 11,289 tweets about the U.S. aid to Ukraine and mapped them to U.S. states. Next, we explore three different approaches to sentiment analysis: tool-based, embedding-based, and prompt-based. Our results indicate that GPT-4 Few Shot improves accuracy by 121.8% and 77.5% over TextBlob and Vader, respectively. Our geospatial analysis shows that Indiana has the most negative normalized net sentiment (NNS) of -0.83, while Vermont has the most positive NNS of +0.33. Finally, we perform a detailed thematic analysis to identify the common arguments that support or oppose the aid. We highlight that our results do not correlate with media surveys, possibly due to the presence of echo chambers and algorithmic biases.
  • Item
    Introduction to the Minitrack on Data Analytics, Data Mining, and Machine Learning for Social Media
    (2025-01-07) Yates, David; Mentzer, Kevin; Gerhart, Natalie