Using machine learning to analyze unproven narratives on social media
April 25, 2022
April 25, 2022
The COVID-19 pandemic drove a parallel “infodemic”: the rapid spread of competing and often harmful narratives about the virus.
Social media continues to play a central role in this infodemic, serving as a forum for the spread and evolution of theories and beliefs with origins in broadcast, print, online news, blogs, and other digital arenas.
The ability of decision-makers to understand the role of social media in spreading ideas and beliefs is limited by the scale of activity on these platforms, where users produce millions of posts per day.
To help navigate this tsunami of information, analysts supporting time-sensitive missions have often had to apply vague, limited analytic approaches. Alternatively, assessments that require accurate, granular data analytics necessitated labor-intensive approaches.
As a result, decision-makers were left without the information they needed in fast-moving situations.
As the COVID-19 infodemic grew, our teams used Amazon Web Services (AWS) to help produce a capability that overcame the tradeoff of accuracy and speed in social media analysis. The result, dubbed Rapid Narrative Analysis (RNA), achieves accuracy by using human expertise at critical stages of analysis while using machine learning (ML) models to rapidly diagnose the severity of the spread of key narratives at a speed needed to take effective action.
<<< Start >>>
<<< End >>>
Using RNA, we compared the severity of three prominent, unproven assertions within the Twitter discussion of COVID-19:
To do this, RNA compared the size of online communities (i.e., number of users) discussing the belief or disbelief of these assertions and then measured the rate of growth of these communities over time.
Our RNA approach begins with creating a small amount of high-quality, human-processed data to train a collection of machine learning models. We determined that as few as 500 human-coded tweets—approximately eight hours of human effort—was enough training data for our models to maintain a high accuracy while analyzing hundreds of thousands of tweets. By training on top of state-of-the-art language models, we achieved a high accuracy (F1 score of 85 percent, a metric for measuring the accuracy of ML models) despite a small training set.
To human-code the data, RNA used our Mission Analytics user interface to ingest and label a subset of tweets that were representative of the topics of interest. This training data was then deployed to train and apply a model to more than 700,000 tweets via an AWS AMI using an Amazon Elastic Compute Cloud (Amazon EC2) GPU Spot Instance configured with AdaptNLP, our open-source natural language processing (NLP) framework.
To achieve these results at scale, our Machine Learning Center of Excellence (ML-COE) used numerous AWS cloud computing services, including Amazon Deep Learning AMIs, Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon Elastic Container Registry (Amazon ECR).
RNA moves beyond “virality,” a core concept underlying dominant approaches to social media analysis defined as “the tendency of an image, video, or piece of information to be circulated rapidly and widely from one Internet user to another.”
The concept of virality has often led to a myopic focus in social media analytics on discrete markers—such as hashtags, links to outside domains, or keywords. The measured prominence of certain sets of markers is used to infer the prominence of particular perspectives or beliefs that roughly correlate with those markers—essentially, the more that a marker is seen, the more prominent support for that perspective or belief is inferred to be.
However, this approach often fails to capture the complex and dynamic ways these markers are utilized, co-opted, and satirized in the public space – users share links to stories and perspectives they both agree and disagree with, and trolls and bots use trending hashtags to hijack conversations and introduce entirely unrelated ideas.
RNA uses an analytic framework more akin to that of “virulence” in medicine, which is concerned with “the severity or harmfulness of a disease or poison.” In social media analysis, this framework puts the emphasis on measuring how believed ideas are online by assessing the full language of users’ posts, not just how often specific markers appear.
<<< Start >>>
<<< End >>>
In a matter of hours, RNA collected and diagnosed more than 700,000 tweets to assess the virulence of each of the three target narratives, revealing the COVID-19 is a biological weapon assertion to have a significantly higher share of believers (59%) than discussion of other assertions. This result is striking in comparison to the assertion that 5G is responsible for COVID-19, which was the only discussion with considerably more disbelieving participants (45%) than believers (35%).
RNA also mapped hidden features in the development of the COVID-19 Twitter discussion over time. Such features are generally not revealed by traditional hashtag analysis. For the COVID-19-as-a-biological-weapon discussion, the number of new users entering the conversation and expressing belief in the assertion continued to grow even after the number of total participants in the conversation began to decline.
Our approach is designed to integrate into an organization’s information lifecycle, enabling more informed decisions to address harmful misinformation based on its virulence and the ability to evaluate the effectiveness of those decisions in near real-time.
A version of this case study was originally published on the AWS Public Sector Blog by Novetta. Accenture Federal Services acquired Novetta in 2021.
Learn more about how we integrate technology to deliver meaningful outcomes for our National Security customers.