U.Chicago Digital Media Workshop and Poetry and Poetics Workshop | 4-5:30pm CT, May 17, 2021
Robert Twomey, Ph.D. | roberttwomey.com
With recent advancements in machine learning techniques, researchers have demonstrated remarkable achievements in image synthesis (BigGAN, StyleGAN), textual understanding (GPT-3), and other areas of text and image manipulation. This hands-on workshop introduces state-of-the-art techniques for text-to-image translation, where textual prompts are used to guide the generation of visual imagery. Participants will gain experience with Open AI's CLIP network and Google's BigGAN, using free Google Colab notebooks which they can apply to their own work after the event. We will discuss other relationships between text and image in art and literature; consider the strengths and limitations of these new techniques; and relate these computational processes to human language, perception, and visual expression and imagination. Please bring a text you would like to experiment with!
Time | Activity |
---|---|
4:00 | Introductions; Open up Google colab; Introduction to Neural Nets, Generative Adversarial Networks (GANs), Generative Text (Transformers). |
4:10 | Hands on with CoLab notebook: CLIP + BigGAN + CMA-ES; Talk about format of textual "prompts"/inputs; Explore visual outputs. |
4:40 | Check in on results. Participants informally share work with group; Q&A about challenges/techniques. Participants continue working. |
5:00 | Hands on with CoLab: Interpolation and latent walks. |
5:10 | Discussion, Future Directions |
5:30 | End |
Click on the links below to open the corresponding notebooks in google colab. You can only run one at a time.
- BigGAN - BigGAN_handson.ipynb
- Text to Image Generation with BigGAN and CLIP - text_to_image_BiGGAN_CLIP.ipynb
- Generate latent interpolations - generate_from_stored.ipynb
- Batch process textual prompts - text_to_image_batch.ipynb (not yet implemented on colab)
- How do words specify/suggest/evoke images?
- What do you see when you read? Are some texts more or less imagistic?
- How can we use this artificial machine imagination to understand our human visual imagination?
- How might you incorporate these techniques into our creative production or scholarship?
- What would it mean to diversify machine imagination?
- Google Deep Mind BigGAN, Large Scale GAN Training for High Fidelity Natural Image Synthesis, 2018
- see the BigGAN hands-on notebook above to get a sense for image generation with BigGAN, noise vectors, truncation, and latent interpolation.
- NVIDIA StyleGAN2, A Style-Based Generator Architecture for Generative Adversarial Networks, 2019
- see for example https://thispersondoesnotexist.com/, a photorealistic face generator with StyleGAN2
- OpenAI GPT-3: Language Models are Few-Shot Learners, 2020
- see Kenric Allado-McDowell's Pharmako-AI for an example a book written with GPT-3.
- OpenAI CLIP: Connecting Text and Image, 2021
- OpenAI DALL-E: Creating Images from Text, 2021
- the interactive examples on this page will give you a sense of the kind of technique we will explore during the workshop.
- Good list of CLIP-related to text-to-image notebooks on Reddit r/MachineLearning
Neural Network
Neural Networks, or Artificial Neural Networks (ANNs) are networks (graphs) composed of nodes and edges, loosely modelled on the architecture of biological brain. They are generally composed of distinct layers of neurons, where outputs from one feed inputs of another. Broadly, each node resembles a neuron, accepting inputs from a number of other nodes, and defined with its own activiation function, bias, and forward connections. There are many variations on this basic architecture. Above we see a very simple fully connected, feed forward network that takes as an input 28 x 28 pixel grayscale images (784 input signals), and produces a 0-10 digit classifier on the output. Neural networks are used for many generative and predictive tasks across sound, image, text, etc.
Generative Adversarial Networks (GANs)
A Generative Adversarial Network (GAN) is a kind of generative model. The basic idea is to set up a game between two players (game theory). The Generator creates samples that resemble the input dataset. The Discriminator evaluates samples to determine if they are real or fake (binary classifier). We can think of the generator as being like a counterfeiter, trying to make fake money, and the discriminator as being like police, trying to allow legitimate money and catch counterfeit money. To succeed in this game, the counterfeiter must learn to make money that is indistinguishable from genuine money, and the generator network must learn to create samples that are drawn from the same distribution as the training data. (adversarial) Both networks are trained simultaneously.
Ian Goodfellow introduced the architecture in Generative Adversarial Nets, Goodfellow et. al (2014) https://arxiv.org/pdf/1406.2661.pdf
BigGAN
BigGAN (2018) set a standard for high resolution, high fidelity image synthesis in 2018. It contained four times as many parameters and eight times the batch size of previous models, and synthesized a state of the art 512 x 512 pixel image across 1000 different classes from Imagenet. It was also prohibitively expensive to train! Thankfully Google/Google Brain has released a number of pretrained models for us to explore. Read the paper here https://arxiv.org/abs/1809.11096.
CLIP
CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. We found CLIP matches the performance of the original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples, overcoming several major challenges in computer vision.
CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. We then use this behavior to turn CLIP into a zero-shot classifier. We convert all of a dataset’s classes into captions such as “a photo of a dog” and predict the class of the caption CLIP estimates best pairs with a given image.
CLIP learns from unfiltered, highly varied, and highly noisy data ... text–image pairs that are already publicly available on the internet. See details on the CLIP Model Card
To learn more about CLIP, try the Interacting with CLIP colab: https://colab.research.google.com/github/openai/clip/blob/master/notebooks/Interacting_with_CLIP.ipynb)