Data Analytics, Data Mining and Machine Learning for Social Media

Permanent URI for this collection


Recent Submissions

Now showing 1 - 10 of 12
  • Item
    Detection of Sentiment Provoking Events in Social Media
    ( 2019-01-08) Daou, Hoda
    Social media has become one of the main sources of news and events due to its ability to disseminate and propagate information very fast and to a large population. Social media platforms are widely accessible to the population making it difficult to extract relevant information from the huge amount of posted data. In our study, we propose an algorithm that automatically detects events using strong sentiment classification and appropriate clustering techniques. We focus our study on a specific type of events that triggers strong sentiment among the public. Results show that the suggested methodology is able to identify important events, such as a mass shooting and plane crash, using a generalized and simple approach.
  • Item
    Utilizing Social Media For Lead Generation
    ( 2019-01-08) Prakash, Adjitesh ; Caton, Simon ; Haas, Christian
    Social Media is the most prevalent platform for communication, forming and maintaining professional as well as social relationships. The growth of platforms and the exponential rise in the user base of social media websites like LinkedIn, Facebook and Twitter, is evidence of their widespread acceptance. They pose many opportunities for businesses to exploit this facet of digitally mediated relationships, for example spreading awareness about the business and engage with prospective customers. The focus of this research is on the use of social media to identify relevant profiles or ``leads'' for a business in sourcing new employees, or collaborators. The paper utilizes data from social networking sites Twitter and LinkedIn and presents an automated approach for the discovery of leads. For the considered business cases, Twitter was found to be irrelevant for lead generation due to its emphasis on personal vs. professional user positioning. The presented final approach utilizes only four attributes from LinkedIn users' profiles to generate high quality leads, and is tested for robustness to variations in input data, different business contexts and vulnerability to noise in the input data. The results show the robustness and consistency of the presented approach to generate leads despite utilizing a small subset of features.
  • Item
    DEvIR: Data Collection and Analysis for the Recommendation of Events and Itineraries
    ( 2019-01-08) Nurbakova, Diana ; Laporte, Léa ; Calabretto, Sylvie ; Gensel, Jerome
    Distributed events such as multi-day festivals and conventions attract thousands of attendees. Their programs are usually very dense, which makes it difficult for users to select activities to perform. Recent works have proposed event and itinerary recommendation algorithms to solve this problem. Although several datasets have been made available for the evaluation of event recommendation algorithms, they do not suit well for the case of distributed events or itinerary recommendation. Based on the study of available online resources, we define dataset attributes required to perform event and itinerary recommendations in the context of distributed events, and discuss the compliance of existing datasets to these requirements. Revealing the lack of publicly available datasets with desired features, we describe a data collection process to acquire the publicly available data from a major comic book convention website. We present the characteristics of the collected data and discuss its usability for evaluating recommendation algorithms.
  • Item
    Gradients of Fear and Anger in the Social Media Response to Terrorism
    ( 2019-01-08) Baucum, Matthew ; John, Richard
    Research suggests that public fear and anger in wake of a terror attack can each uniquely contribute to policy attitudes and risk-avoidance behaviors. Given the importance of these negative-valanced emotions, there is value in studying how terror events can incite fear and anger at various times and locations relative to an attack. We analyze 36,259 Twitter posts authored in response to the 2016 Orlando nightclub shooting and examined how fear- and anger-related language varied with time and distance from the attack. Fear-related words sharply decreased over time, though the trend was strongest at locations near the attack, while anger-related words slightly decreased over time and increased with distance from Orlando. Comparing these results to users’ pre-attack emotional language suggested that distant users remained both angry and fearful after the shooting, while users close to the attack remained angry but quickly reduced expressions of fear to pre-attack levels.
  • Item
    Incorporating Context and Location Into Social Media Analysis: A Scalable, Cloud-Based Approach for More Powerful Data Science
    ( 2019-01-08) Anderson, Jennings ; Casas Saez, Gerard ; Anderson, Kenneth ; Palen, Leysia ; Morss, Rebecca
    Dominated by quantitative data science techniques, social media data analysis often fails to incorporate the surrounding context, conversation, and metadata that allows for more complete, accurate, and informed analysis. Here we describe the development of a scalable data collection infrastructure to interrogate massive amounts of tweets—including complete user conversations—to perform contextualized social media analysis. Additionally, we discuss the nuances of location metadata and incorporate it when available to situate the user conversations within geographic context through an interactive map. The map also spatially clusters tweets to identify important locations and movement between them, illuminating specific behavior, like evacuating before a hurricane. We share performance details, the promising results of concurrent research utilizing this infrastructure, and discuss the challenges and ethics of using context-rich datasets.
  • Item
    Unsupervised Content-Based Characterization and Anomaly Detection of Online Community Dynamics
    ( 2019-01-08) Shah, Danelle ; Hurley, Michael ; Liu, Jessamyn ; Daggett, Matthew
    The structure and behavior of human networks have been investigated and quantitatively modeled by modern social scientists for decades, however the scope of these efforts is often constrained by the labor-intensive curation processes that are required to collect, organize, and analyze network data. The surge in online social media in recent years provides a new source of dynamic, semi-structured data of digital human networks, many of which embody attributes of real-world networks. In this paper we leverage the Reddit social media platform to study social communities whose dynamics indicate they may have experienced a disturbance event. We describe an unsupervised approach to analyzing natural language content for quantifying community similarity, monitoring temporal changes, and detecting anomalies indicative of disturbance events. We demonstrate how this method is able to detect anomalies in a spectrum of Reddit communities and discuss its applicability to unsupervised event detection for a broader class of social media use cases.
  • Item
    Toward Automatic Fake News Classification
    ( 2019-01-08) Ghosh, Souvick ; Shah, Chirag
    The interaction of technology with humans have many adverse effects. The rapid growth and outreach of the social media and the Web have led to the dissemination of questionable and untrusted content among a wider audience, which has negatively influenced their lives and judgment. Different election campaigns around the world highlighted how ''fake news'' - misinformation that looks genuine - can be targeted towards specific communities to manipulate and confuse them. Ever since, automatic fake news detection has gained widespread attention from the scientific community. As a result, many research studies have been conducted to tackle the detection and spreading of fake news. While the first step of such tasks would be to classify claims associated based on their credibility, the next steps would involve identifying hidden patterns in style, syntax, and content of such news claims. We provide a comprehensive overview of what has already been done in this domain and other similar fields, and then propose a generalized method based on Deep Neural Networks to identify if a given claim is fake or genuine. By using different features like the authenticity of the source, perceived cognitive authority, style, and content-based factors, and natural language features, it is possible to predict fake news accurately. We have used a modular approach by combining techniques from information retrieval, natural language processing, and deep learning. Our classifier comprises two main sub-modules. The first sub-module uses the claim to retrieve relevant articles from the knowledge base which can then be used to verify the truth of the claim. It also uses word-level features for prediction. The second sub-module uses a deep neural network to learn the underlying style of fake content. Our experiments conducted on benchmark datasets show that for the given classification task we can obtain up to 82.4% accuracy by using a combination of two models; the first model was up to 72% accurate while the second model was around 81\% accurate. Our detection model has the potential to automatically detect and prevent the spread of fake news, thus, limiting the caustic influence of technology in the human lives.
  • Item
    Which Way to the Wheat Field? Women of the Radical Right on Facebook
    ( 2019-01-08) Squire, Megan
    At what rates and in what capacity do women participate in extreme far-right ("radical right") political online communities? Gathering precise demographic details about members of extremist groups in the United States is difficult because of a lack of data. The purpose of this research is to collect and analyze data to help explain radical right participation by gender on social media. We used the public Facebook Graph API to create a large dataset of 700,204 members of 1,870 Facebook groups spanning 10 different far-right ideologies during the time period June 2017 - March 2018, then applied two different gender resolution software packages to infer the gender of all users by name. Results show that users inferred to be women join groups in some ideologies at a greater rate than others, but ideology alone does not determine leadership opportunities for women in these groups. Furthermore, our analysis finds similarities between historical women's organizations such as the 1920s Women's Ku Klux Klan and contemporary online "wheat field" groups designed specifically for women.
  • Item
    Collective Classification for Social Media Credibility Estimation
    ( 2019-01-08) O'Brien, Kyle ; Simek, Olga ; Waugh, Frederick
    We introduce a novel extension of the iterative classification algorithm to heterogeneous graphs and apply it to estimate credibility in social media. Given a heterogeneous graph of events, users, and websites derived from social media posts, and given prior knowledge of the credibility of a subset of graph nodes, the approach iteratively converges to a set of classifiers that estimate credibility of the remaining nodes. To measure the performance of this approach, we train on a set of manually labeled events extracted from a corpus of Twitter data and calculate the resulting receiver operating characteristic (ROC) curves. We show that collective classification outperforms independent classification approaches, implying that graph dependencies are crucial to estimating credibility in social media.
  • Item
    A Latent Dirichlet Allocation Approach using Mixed Graph of Terms for Sentiment Analysis
    ( 2019-01-08) Casillo, Mario ; Clarizia, Fabio ; Colace, Francesco ; De Santo, Massimo ; Lombardi, Marco ; Pascale, Francesco
    The spread of generic (as Twitter, Facebook orGoogle+) or specialized (as LinkedIn or Viadeo) social networks allows to millions of users to share opinions on different aspects of life every day. Therefore this information is a rich source of data for opinion mining and sentiment analysis. This paper presents a novel approach to the sentiment analysis based on the Latent Dirichlet Allocation (LDA) approach. The proposed methodology aims to identify a word-based graphical model (we call it a mixed graph of terms) for depicting a positive or negative attitude towards a topic. By the use of this model it will be possible to automatically mine from documents positive and negative sentiments.Experimental evaluation, on standard and real datasets, shows that the proposed approach is effective and furnishes good and reliable results.