Baseline: DreamBooth, BEiT-3, and more – October 2022
October 03, 2022
October 03, 2022
Welcome to the October 2022 edition of Baseline, Accenture Federal Services’ machine learning newsletter. In Baseline, we share insights on important advances in machine learning technologies likely to impact our federal customers. This month we cover the following topics:
Click here to subscribe to receive Baseline every month in your inbox and stay up-to-date on the latest advancements in machine learning.
Text-to-image models have garnered a lot of attention within the machine learning sphere recently. Text guided prompts allow for endless creativity when generating images. Generally, the generation algorithms render their own interpretation of the prompt, limiting the ability to tune the generation towards a specific image. In order to achieve a specific look, users perform "prompt engineering,” where they try many variations of a prompt to develop a specific outcome.
To address this use case, Google and Boston University collaborated on a novel approach called DreamBooth that allows users to guide the generation process by feeding the model a small number of input images. Their work improves upon current leading text-to-image models such as DALL-E 2 and GLIDE by enabling subject synthesis from as few as 3-5 prompt images. Using this approach, the model can take a target object and generate images showing it in diverse contexts with high fidelity. While it was built to be model agnostic, they utilized a pre-trained version of Google’s Imagen model.
DreamBooth makes great strides in subject-driven generation and provides an innovative technique for fine-tuning text-to-image diffusion models. This few shot prompt tuning will allow users to guide image generation and output more relevant images for their given task.
Multimodal machine learning is a research field that aims to build a unified model which can handle information from multiple modalities, such as text, audio, and visual data. Multimodal research has unique challenges because of the diversity of the data across different tasks. These challenges motivated Microsoft researchers to create BEiT-3 (BERT Pre-training of Image Transformers), a general-purpose multimodal foundation model.
Using a shared Multiway Transformer, BEiT-3 performs masked “language” modeling on monomodal (images and texts separately) and multimodal (image-text pairs) data. Results show that BEiT-3 sets a new performance standard on a wide range of vision and vision-language benchmarks. From object detection to visual question learning, BEiT-3 performs better than previous state-of-the-art foundation models.
Overall, this new method gives a simple and effective way for building multimodal foundation models at scale and will allow users to harness data across different modalities simultaneously to improve results.
There has been a breakthrough in the quality of text-to-image generation models with the announcement of models such as DALL-E 2 and Imagen, which we covered in a previous edition of Baseline. Since then, the release of another text-to-image generation model called Stable Diffusion has been making waves, not only due to its performance but also due to the decision of its creators to release the full model weights and source code.
Stable Diffusion is the first open-source, free to use, large-scale text-to-image generation model. While DALL-E 2 is publicly available for running inference, its code has not been officially released and it requires users to pay for image generation credits. The open release of Stable Diffusion has led to an explosion of innovation around the model with tools such as Diffuse The Rest, which allows users to input text and a sketch of an image to be used as inputs to the diffusion model.
Despite the innovation that Stable Diffusion has fueled, it also brings with it concerns around misuse due to the lack of restrictions that previous models such DALL-E 2 had. For example, previous closed-source models excluded real people as acceptable prompt inputs to avoid generating potentially offensive content.
Overall, Stable Diffusion has lowered the bar for ML practitioners to be able to utilize and customize text-to-image generation models for their own use cases, which will accelerate the creation of creative applications by the community.
When training machine learning models, a common step towards achieving high performance is hyperparameter optimization. Hyperparameters are values that control the learning process and are set before the training of a machine learning model begins. Examples include the learning rate, how splits are made in tree-based methods, and how much weight to give to certain factors. There are a number of different hyperparameter optimization algorithms that are commonly used such as grid search, random search, and regularized evolution, to name a few.
Previously, each of these algorithms would need to be performed separately as the search space for each algorithm varies. Google AI has developed OptFormer, the first Transformer-based hyperparameter optimization framework. It has shown that a single Transformer network is capable of imitating seven different optimization algorithms with a single model. The creation of this model and the progress towards universal hyperparameter optimization paves the way for increased automated machine learning capabilities, which will lower the manual tuning and expertise needed to train high performing models.
That’s Baseline for October 2022! Click here to subscribe to receive the November Baseline in your inbox.
Accenture Federal Services is a leader in artificial intelligence for the U.S. federal government. Our Machine Learning Center of Excellence, Discovery Lab, and Advanced Research Group continually assess, develop, and adapt the world’s most innovative techniques and emerging technologies for mission-critical applications.