As COVID-19 has spread throughout the world, it has driven more and more of our lives online. We’re relying on online shopping to buy goods and services, digital collaboration tools for work and learning – and increasingly, on social media platforms for information about the pandemic. A recent study saw a 25% increase in the volume of posts on Twitter since the spread of COVID-19 began. Unfortunately, some of that increasing volume includes misinformation about COVID-19 – enough that Twitter outlined the criteria it uses to assess and remove misleading pandemic information on its platform.
Misinformation online isn’t an isolated phenomenon related to COVID-19. A well-known 2018 MIT study found that misinformation spreads rapidly online no matter the topic, reaching more people than the truth and spreading faster. Unintentional sharing of incorrect information, sensationalism, rumor, and urban legends proliferate; we see spamming and trolling attempts; we even see complex and deliberate attempts to misinform, like deepfakes.
Working with Indraprastha Institute of Information Technology Delhi, we’ve developed a robust AI-driven approach to this problem. Our solution: a semi-supervised end-to-end attention neural network that detects Twitter posts with misinformation about COVID-19. In early pilots on a dataset of more than 21 million COVID-19-related tweets, it identifies posts with misinformation with 95% accuracy, significantly outperforming comparable algorithms.
This kind of semi-automated detection is key in light of the growing misinformation challenge. When it comes to COVID-19, the WHO director general stated recently that “We’re not just fighting an epidemic; we’re fighting an infodemic.” The spread, speed and complexity of misinformation in social media is overwhelming the human capacity to manually fact-check and regulate it, and companies are increasingly deploying artificial intelligence to assist human fact-checkers. Still, most current AI models need humans to manually label or categorize large amounts of data before the systems can work. Even then, they struggle to identify misinformation that differs from what was found in the training data.
With the evolving avalanche of misinformation about COVID-19, these are significant challenges. The types of related misinformation range from incorrect health advice (for example, “eating garlic cures the virus,”) to false information about its origin and spread (for example, “5G networks are related to the spread of the virus,”) and false information about its severity (“Coronavirus is just like a normal cold, or a mild flu”). It’s hard to say which of these causes the most harm, but all are potentially dangerous to people’s health and safety.
Unlike other attempts at AI detection of misinformation, our solution considers multiple pieces of context to determine if information is genuine. It doesn’t just look at the content of a tweet, but also information about the user who posted it, for example – and finds the right balance with which to weigh those inputs.
It’s semi-supervised in that it can leverage both labeled and unlabeled data; it learns the semantics and meaning from unlabeled data.
Why end-to-end? Because it also keeps up with changing information and emerging misinformation trends by leveraging external knowledge (from both reliable and unreliable sources). And finally, it’s explainable – it can tell you why it thinks a particular post contains misinformation.
Our approach uses linguistic analysis of the message content itself, such as the terms in the post, incongruity and sarcasm, the sentiment expressed, and so on. But it can also look at the background of the user, the social network context, number of reposts, etc., to find the tweet’s virality – false and sensational viral posts tend to spread faster and wider. It also incorporates automated checking of the topic and claims against fact-checking sites such as Snopes in real-time. Being able to identify individual claims within a larger piece of content helps catch misinformation embedded in otherwise innocuous material – something that other approaches often miss. None of these approaches may be effective while applied individually, but they can be quite powerful when applied together.
Just how powerful? To test our approach, we began by developing a dataset of publicly available tweets. The dataset is a mix of both labeled and unlabeled posts – more than 45,000 labeled tweets, about 60% of which contained misinformation, and more than 21 million additional unlabeled COVID-related tweets We compared the accuracy of the model on this dataset with seven state-of-the-art models for detection of misinformation, and it outperformed them all by at least 9%. We’re in the process of doing additional testing on other published datasets.
This is an early effort, but with promising results, especially since our AI can quickly respond to emerging events that generate more misinformation. It could be easily incorporated into workstreams to assist human moderators, identifying possible misinformation with supporting information. This would not only make it easier for moderators to find and remove misinformation, but potentially respond to those who inadvertently shared the content with links to reliable sources. Over time, this could reduce the amount of misinformation that’s inadvertently shared in the first place.
We are working to make the system scalable and capable of tackling a wide range of topics, pandemic related or otherwise. Stay tuned to learn more about our efforts!
The authors would like to acknowledge Professor Tanmoy Chakraborty of Indraprastha Institute of Information Technology Delhi for his collaboration on this research. For more information about our work in this space, contact Shubhashis Sengupta.