Chris Mair


Text-to-image generation break-through

Text-to-image generation

Recently, the machine learning community has seen a significant break-through in text-to-image generation.

OpenAI first made the news in 2021 when they announced DALL-E, a system that can generate any image from a text prompt. In April 2022 OpenAI presented an improved DALL-E 2. In early summer 2022 Google showed two research projects capable of text-to-image generation with unprecedented realism: Imagen and Parti.

Unfortunately all these systems are proprietary and not released to the public (Dall-E has an invitation-only demo, though).


Things got interesting this summer when Boris Dayma and Pedro Cuenca reproduced OpenAI’s work at a smaller scale1. They called their system DALL-E Mini, later renamed to Craiyon upon request by OpenAI.

They released everything (the software as well as the trained model) under an Open Source license2. While there is a web-base demo3, computing is nicer when it’s done by oneself, so I had to try this out :).

The repo provides a Jupyter notebook that can easily be run as a Python batch job with a few lines changed. Well… one thing led to the other, and I ended up with about 700 images…

Craiyon generates images at a fixed size of 256x256 pixels, it runs fine on CPU only, where it needs a few minutes per image. On a beefy GPU this goes down to a few seconds per image.

Here is Craiyon’s rendering of Linus Torvalds as “Linus Torvalds, “Linus Torvalds, crayon painting” and “Linus Torvalds, hugging a penguin”:

Linus Torvalds rendered by Craiyon
Linus Torvalds rendered with Craiyon

(using the latest “mega-1-fp16” model as of late August 2022).

While its results are obviously impressive, Craiyon’s model as of today is a bit limited by the relatively small output size. Also, as you can see here, it does struggle quite a bit with human faces and hands. It can paint in any style, though. A trick that often works is asking for a watercolor, crayon or oil painting, or similar. This covers imperfections and rendering artefacts behind the particular artistic style.

Stable Diffusion

Just after Craiyon went viral, a week ago, researchers Rombach, Blattmann, Lorenz, Esser and Ommer released software and model for their text-to-image system Stable Diffusion (SD)4.

The software is hosted on GitHub5. To try it out locally you need to create an account on Hugging Face and download the model from there (I used the v 1.4 original model6).

I did.

And I was impressed.

Here’s Linus again, same prompts:

Linus Torvalds rendered by Stable Diffusion
SD: "Linus Torvalds"
Linus Torvalds rendered by Stable Diffusion
SD: "Linus Torvalds, crayon painting"
Linus Torvalds rendered by Stable Diffusion
SD: "Linus Torvalds, hugging a penguin"

There are still some problems with the rendering of eyes and fingers in particular (and random penguin feet where they don’t belong). But, unless one is deliberately looking for artefacts, looking casually, these images would pass as real.

Interesting times ahead

The elephant in the room is that people are going to abuse this. The creators of SD were aware of this and built some mitigation into SD. Legally, there is the license: the “CreativeML Open RAIL-M” license is quite permissive, but has anti-abuse clauses. Technically, SD does a safety check (that triggers way too often, giving you an image of Rick Astley) and adds an invisible watermark to the images, presumably tagging them as fake. I couldn’t yet understand exactly what is written to the watermark, maybe there is some upcoming standard for this? Of course, that’s just some lines in Python code that anyone can comment out.

I felt uncomfortably releasing my fake Torvalds images to the web as they were, so I embedded the warning text. Just think what happens if these get into the search engine’s indexes, are copied, reused etc… Now that the genie is out of the bottle this is going to happen anyway, I guess.

There are interesting repercussions for copyright law as well. As far as I understand (don’t quote me on this), the output of systems as SD or Craiyon are not subject by somebody else’s copyright (at least if you did the computing at your own expense). This is going to revolutionize a whole industry of content creators.

Experimenting with art styles

Unfortunately, and unlike Craiyon, Stable Diffusion does require CUDA, that means it runs only on GPUs from Nvidia. You also need quite some GPU memory. SD can render images at any size (!), but you need increasingly more GPU memory as the size increase.

I spent the weekend experimenting with this and ended up generating about 2000 images on a Tesla V100 with 16 GB of RAM that just takes a few seconds per image. The 16 GB of RAM where enough for ~ 1024x512 pixels (I didn’t test the exact limits). You can already find tutorials online about hacks to lower memory usage, though.

To impress you with the Dolomites in all their over-saturated glory, here are some results for the prompt “A big anime robot on the Dolomites, summer, bright light, blue sky, highly detailed, national geographic magazine photo.”:

Dolomites and Robots rendered by Stable Diffusion
SD: big anime robots on the Dolomites… 1/4
Dolomites and Robots rendered by Stable Diffusion
SD: big anime robots on the Dolomites… 2/4
Dolomites and Robots rendered by Stable Diffusion
SD: big anime robots on the Dolomites… 3/4
Dolomites and Robots rendered by Stable Diffusion
SD: big anime robots on the Dolomites… 4/4

These do absolutely look like professionally made photoshops/renderings, but are generated in seconds from the text prompt alone! Again, these are not just memorized images blended together. As unbelievable as it seems, the SD researchers managed to get a ~ 4 GB blob of floating point numbers that somehow encode the visual knowledge of the world.

SD is especially good at copying any style you tell it. Look at my “A kung fu cat standing in front of a japanese temple, finely detailed art, painted by Studio Ghibli.”:

painted kung fu cat rendered by Stable Diffusion
SD: kung fu cat… 1/2

and at my “An epic kung fu cat marble sculpture, highly detailed, standing in front of a bright japanese temple, digital art.”:

marble kung fu cat rendered by Stable Diffusion
SD: kung fu cat… 2/2

One interesting thing about SD is that you can seed its pseudo-random generator and will always get the same image if you use the same seed (I generated 20 or 30 images starting from seed 1006, the images here are hand-picked from these). Online communities already start to share their prompts and seeds. There is absolutely no doubt that illustrators and graphic designers will feel a huge impact from this, for better or for worse.

Finally, you can also seed images, and SD will work on its prompt starting from your image. So there’s more experimenting to do…

Big G is still ahead, but not much?

Google’s Parti team published a few reference images at different snapshots of the learning progress of their net. Follow the link7 and look for the prompt “A portrait photo of a kangaroo wearing an orange hoodie and blue sunglasses standing on the grass in front of the Sydney Opera House holding a sign on the chest that says Welcome Friends!”

This is what I got from SD for the same prompt:

kangaroo rendered by Stable Diffusion
SD: kangaroo…

Apart from the lost sunglasses, you can see SD is still learning to write (this is the best from a few dozens images), but it’s getting really close to the state of the art… Of course, the fact that we know little about kangaroo anatomy helps keeping the illusion :)