Baseline: DALL-E 2 / Imagen, VALHALLA, Evaluate, & STEGO – July 2022
July 6, 2022
July 6, 2022
Welcome to the July 2022 edition of Baseline, Accenture Federal Services’ machine learning newsletter. In Baseline, we share insights on important advances in machine learning technologies likely to impact our federal customers. This month we cover the following topics:
Click here to subscribe to email updates: Receive Baseline every month in your inbox and stay up-to-date on the latest advancements in machine learning.
Methods to generate images based on natural language descriptions have advanced at a rapid pace since 2021 when OpenAI released DALL-E, a 12-billion parameter model that pushed the boundaries of AI’s ability to represent and combine concepts. This year, two new models have achieved state-of-the-art performance on text-to-image synthesis.
OpenAI’s DALL-E 2 improves upon its predecessor in generating images which are more realistic and aligned with input text, at 4x greater resolution than before. This improvement is largely based on their use of a diffusion model, a new approach which is more efficient than the previous transformer-based approach.
Nearly simultaneously, Google Research released Imagen, their version of a text-to-image synthesis model. Like DALL-E 2, they utilize a diffusion model for image generation. However, they found that using a large, pre-trained transformer language model to encode the input text boosted image quality and alignment by giving the model a more advanced grasp of human language and concepts.
While these latest methods have been able to produce astonishing results – creating new possibilities for artistic collaboration between humans and AI – these powerful models also raise many important ethical questions to be addressed. The creators of Imagen acknowledge that their model has produced images with societal and cultural bias. While OpenAI has used their experience with DALL-E to limit DALL-E 2’s ability to generate harmful images, they also recognize more exploration is required to truly understand the societal impacts of this powerful technology.
<<< Start >>>
<<< End >>>
Machine translation is traditionally performed using text data, but multimodal machine translation (MMT) improves upon the traditional method by grounding the text information with corresponding visual depictions during training and testing. However, the requirement to have images makes the performance boost infeasible when accompanying images are not available, such as translating written works or transcripts of audio recordings.
VALHALLA, standing for Visual Hallucination, is a simple but effective translation approach that successfully circumvents that restriction by eliminating the need for images during inference. Inspired by models like DALL-E that create images from natural language text descriptions, VALHALLA involves training a transformer model that maps text to visual representations of real images during training, and then creates “hallucinated,” or synthetic, images for use in testing. These images serve as a conversion factor between languages, enabling VALHALLA to outperform current state of the art multimodal translation systems that rely on token-to-image lookup tables or general adversarial models – achieving even higher-performance gains for low-resource languages.
<<< Start >>>
<<< End >>>
Training a model often involves the evaluation of dozens of algorithm and model configuration combinations, making it difficult to find the best solution for a specific use case. Hugging Face aims to ease that search process with the release of Evaluate. The goal of Evaluate is to have a central library that simplifies and standardizes the evaluation and comparison of model performance. The initial release has built-in implementations of common metrics for tasks such as natural language processing and computer vision. The library is compatible with many common ML frameworks such as NumPy, Pandas, PyTorch, TensorFlow, and JAX.
This type of library is useful for ML practitioners who need to track performance of models across different datasets, or who want to evaluate a group of models varying in size or architecture on a fixed dataset. While other evaluation libraries exist, the simplicity of this implementation, like most Hugging Face libraries, will benefit ML users in need of model evaluation tools.
Semantic segmentation is the process of assigning a label to every pixel in an image. This technique allows for an effective differentiation between various objects within an image. Training supervised semantic segmentation models can be difficult due to the challenge of obtaining accurate training data labels, which is generally an intensive manual process. To address this problem, research scientists from MIT, Microsoft, and Cornell University have introduced STEGO (Self-supervised Transformer with Energy-based Graph Optimization).
STEGO is an algorithm capable of jointly discovering and segmenting objects down to the pixel level without human supervision. The researchers found that existing unsupervised feature learning models, such as DINO, did a good job of determining features important to classification. Furthermore, these features were consistent across similar images within a corpus. Using a novel loss function, STEGO is able to create discrete labels from these features to perform semantic segmentation. STEGO can identify and segment relevant objects on multiple visual domains, such as aerial and urban scene images. Researchers demonstrated that the algorithm achieved state-of-the-art performance on the CocoStuff and Cityscapes segmentation challenges. This algorithm makes the adoption of semantic segmentation models more feasible by lessening the amount of manual labeling needed.
<<< Start >>>
<<< End >>>
Accenture Federal Services is a leader in artificial intelligence for the U.S. federal government. Our Machine Learning Center of Excellence, Discovery Lab, and Advanced Research Group continually assess, develop, and adapt the world’s most innovative techniques and emerging technologies for mission-critical applications.