Image generation

Building simple generative AI models

Examples of images generated by one of the models (DCGAN 4) I have built in this project, using the MNIST dataset for training

Why this project?

Motivations

Since it started making headlines in the late 2010s, I have been fascinated by generative AI: the idea that we can instruct computers to create realistic content (be it images, video, audio, text or anything else) is really mind blowing. The extremely fast pace at which this field is developing is also exciting, and although I am worried (like anybody else) about the potential malicious uses of generative AI, I’m really curious to see how this tool will be used in the future and how it will change our society (hopefully only for the better!).

Since I am a naturally curious person, I wanted to go a little bit deeper and try to understand first-hand how generative AI works in practice, and write my own code to do this. That’s the reason why I embarked on this project, where I built a few simple models for image classification and generation.

Important notes

  1. Generative AI models are usually very computationally expensive, so building any model that deals with real-world data requires an immense amount of memory and computational power; this is also why models like chatGPT, Claude or Midjourney require the structure and the resources of entire companies to be maintained. For this reason, since this project was performed on my personal laptop, its scope and the type of data used were naturally limited by the computational power available to me at the time.

  2. Since here I build and test several different model architectures for image classification and generation, this project is a bit more technical than others and the language used here reflects that. For the sake of clarity but to avoid being too cumbersome, whenever possible I will link to other online resources that explain concepts used here; there are plenty of useful resources online and it wouldn’t make sense to re-write explanations that are already out there.

Aims

The aim of this project is to showcase that I am capable of building deep learning (classification) and generative models for images, and to study the effect that their architecture has on their performance. In particular, for classification models I’ll be monitoring their accuracy and training time; for generative models I’ll be monitoring how well they can generate realistic images and how long it takes to train them.

Skills that I’ve honed with this project:

  • Development of Machine Learning models (scikit-learn)
  • Development of generative AI models (Tensorflow)
  • Model performance evaluation, monitoring and optimization for image generation

Structure of the project

Code availability

The code I wrote for the entire project can be found here.

The code I wrote for this project is shared only as a demonstration of my programming skills, problem-solving approach, and technical implementation abilities.

Methods and results

As explained above I had limited computational power available to me when working on this project, so I have used a lightweight dataset that makes it possible to work with image classification and (most importantly) image generation on a personal laptop: the MNIST dataset. It consists of 70.000 images (of size 28x28 pixels) of hand-written digits, generally divided 60.000/10.000 for training and testing.

To get an idea of what the data looks like, let’s take a look at examples of each digit in the dataset:

Results in a nutshell

Spoiler alert!


Image classification

In this first part of the project I use a few different types of neural network for image classification, and see how changing their architecture affects their accuracy and training time.

Convolutional neural networks

I start with convolutional neural networks (CNNs), which in general are the most popular solution for image classification. The choices for the architectures were made so that I could test the effect of adding more convolutional or fully connected layers on the same baseline (CNN 1). Also, in order to compare the models under the same conditions, I’m always using 100 epochs for training.

CNN 1

I start with a CNN made up of three convolutional layers and one fully connected layer (open any of the images below in new tab/windows for more detail if they look too small):

This is the training history of this model (i.e., how the model’s accuracy and loss change across training epochs):

As we can see, this already simple network reaches a very high accuracy (~98%), but it slightly overfits after epoch ~20.

CNN 2

Let’s now see what happens if we use five convolutional layers and one fully connected layer:

Here is the training history in this case:

The accuracy reached by the network is similar to the previous case, but the overfitting is definitely less pronounced.

CNN 3

Let us now see what happens if we use three convolutional layers and three fully connected layers:

The training history now looks like this:

Again, the accuracy of the model seems comparable to the previous ones, but now the overfitting (which still happens at epoch ~20) is more marked.

Comparing CNNs

As we can see, the behavior of all three models is pretty similar, and that all three models eventually start overfitting.

If we plot the models’ accuracies together, we see that they basically perform identically since they have the same accuracy of1 ~98%:

The best performing model, technically, is CNN 2 with accuracy of ~98.6%. However, this is only ~0.6% higher than the “worst” performing model (CNN 3).

On the other hand, using more complex model architectures (especially additional convolutional layers) leads to significantly longer training times:

The conclusion from this section of the project, therefore, is that increasing the model’s complexity is not convenient in this case because it leads to significantly longer training times and only marginally better accuracy. Sometimes less is more!

Fully convolutional neural networks

Let’s now see what happens if we classify images with fully convolutional neural networks (FCNNs), generally used for image segmentation. These models have the same structure of standard CNNs, with the difference that all layers are convolutional: instead of having fully connected layers at the end, they have global average pooling layers.

FCNN 1

Let’s start with a FCNN made of three convolutional layers and one pooling layer:

This is the training history of this model:

This networks not only is converging slower than “classic” CNNs, but has also a much lower accuracy. Let’s see if we can improve this by changing the networ’s architecture.

FCNN 2

Let’s see what happens if we use a FCNN made up of five convolutional layers and one pooling layer:

Let’s looks at the training history of this model:

This simple modification gave us a much higher accuracy! Also, the convergence of this model is more similar to that of CNNs.

FCNN 3

Let’s now use a FCNN with three convolutional layers (like FCNN 1) but with larger kernel sizes:

The accuracy is even higher now, although there is also slight overfitting after epoch ~40.

FCNN 4

Finally, let’s use a FCNN with five convolutional layers (like FCNN 2) but with larger kernel sizes:

The accuracy is still high but not as high as for FCNN 3, and this time the model overfits after epoch ~20.

Comparing FCNNs

This time we can clearly see that the network’s architecture strongly influences its performance:

Different architectures lead to different accuracies, and again a more complex model is not always the best choice.

There is also a stark difference in training times between the different models:

As we can see, adding additional convolutional layers has a large cost in terms of training time, contrarily to only increasing the kernel size. Even more, as we can see comparing FCNN 2 and FCNN 4, increasing kernel size can actually bring down the training time: larger kernels means that the layers’ outputs are smaller, and therefore less memory and computational power is required for training.

Comparing CNNs and FCNNs

We can also make a scatter plot of the training time vs. accuracy of all these seven models analyzed above to see which one would be the “best” (i.e., the one with the highest accuracy and shortest training time) for this case of image classification:

As we can see, the “best” model in this case is CNN 1, i.e. the simplest CNN architecture we tested. Less is really more: depending on the task at hand, building a more complex model is not always the right answer.

Image generation

In this second part of the project I build a few different types of neural network for image generation, and I will change the models’ architecture to see how that affects their training time and their ability to generate realistic new images.

Convolutional variational autoencoders

The first model class for image generation that Is tart with are convolutional variational autoencoders (CVAEs). These are basically the same as classic variational autoencoders, with the difference that they use convolutional layers instead of only fully connected layers.

CVAE 1

Let’s start with a CVAE that has three convolutional layers and one fully connected layer in both encoder and decoder, and with a hidden layer of dimension 100 and latent space of dimension 2:

Let’s look at the training history of this model:

The model is generally behaving as expected (i.e., K-L loss going up, reconstruction loss going down, both eventually stabilizing), although the model is converging rather slowly.

To evaluate this model, we can look at how the reconstructed digits are separated in the latent space:

From this plot we can see that this model can reconstruct well 0s, 1s, 2s, ans 6s while the other digits are not well separated. Let’s take a look at the reconstructed images across the latent space:

Our initial hunch was correct. We can also distinguish some other digits (e.g., 3s and 9s), but in general this model cannot reproduce all 10 digits.

CVAE 2

Let’s now see what happens if we keep the same overall architecture as CVAE 1 but we increase the hidden layer dimension (i.e., the size of the fully connected layer before the latent space representation) to 200:

The situation is qualitatively the same.

CVAE 3

Let us now go back to the original hidden layer dimension (i.e., 100), but let’s also add two convolutional layers in both encoder and decoder:

There is some improvement (e.g., we can see 5s now), but it’s not drastic.

CVAE 4

Let us try with a network that has three convolutional layers but now also three fully connected layers in both encoder and decoder:

This looks even a bit better, but again the model is not capable of fully separating all ten digits.

CVAE 5

Finally, let’s try adding one additional fully connected layer to both encoder and decoder:

The situation has not changed drastically.

It looks like in general CVAEs might not be the best choice to generate images in this case.

Comparing CVAEs

To compare the performance of these models we can compute the silhouette scores2 for all data points shown above. This can give us a quantitative sense of how well these models are separating the digits in the latent space:

All models perform similarly bad overall.

To conclude this section, let’s take a look at their training times:

As we can see, making the model more complex has a cost in terms of training time, and again adding additional convolutional layers has a much higher cost compared to adding fully connected layers.

Deep convolutional generative adversarial networks

Generative adversarial networks (GANs) are a popular choice for generative AI applications. In this section I develop a few deep convolutional generative adversarial networks (DCGANs) to generate images of hand-written digits.

DCGAN 1

Let’s start by building a DCGAN with three convolutional layers in both the generator and discriminator, and using a tanh activation function in the last layer of the generator (as this is a very popular choice with DCGANs):

Let’s look at the training history of this model:

As we can see, after an initial period of quick learning the network is converging slowly. We can expect that the images generated by this network will not look good:

Indeed, the generator is not doing a good job and can only generate barely passable 0s and 9s.

DCGAN 2

Let’s see now what happens if we keep the same network architecture but we change the activation function at the end of the generator into a sigmoid.

This is the training history of the model in this case:

We can immediately see that the behavior of the network is different, and that after some oscillations (albeit small, notice how much tighter the y-axis is in this case compared to the previous) both the generator and discriminator have converged; this is the type of behavior that GANs generally exhibit.

Let’s look at the images created by the generator in this case:

We can definitely see an improvement in the network’s performance, which came by simply changing the activation function in the very last layer from tanh to sigmoid. However, most of the generated images still look like squiggles.

DCGAN 3

Let’s keep building on this model, and increase the kernel sizes in the convolutional layers:

The behavior of the network during training is qualitatively the same as before:

The performance of the generator, however, is markedly improved:

It looks like there is still some room for improvement, though.

DCGAN 4

Finally, let’s try adding one additional convolutional layer in both the generator and discriminator:

In this case the network seems to be converging much faster:

And here are the generated images:

Even though some images still look like squiggles (which could still be due to the fact that images are generated by randomly sampling the latent space), the generator is now able to create realistic images of all ten digits (i.e., for each digit we can see at least one realistic image being generated)!

Comparing DCGANs

Let’s see how the different models compare in terms of training times:

As we can see, increasing the complexity of the model increases the training time, with a much higher cost for adding convolutional layers compared to increasing the size of the kernels.

However, given how much better DCGAN 4 performs against all the other models, in this case the substantially longer training time is totally worth it.

Conclusions

There are two major conclusions we can draw from this project, which are general lessons for dealing with deep learning and generative AI:

  • For image classification, using more complex models is not always the right answer: simple models have the same accuracy but much shorter training times
  • For image generation, the network’s performance is mostly determined by its architecture, but in order to get realistic images there is a lot of fine-tuning involved

Footnotes

  1. We can reach such a high accuracy because the data is relatively simple. Using more “real-world” data the accuracy might be significantly lower. Again, we are using this simple data because it makes it possible to build generative models that can run easily on a personal laptop. 

  2. The silhouette score goes from -1 to 1, with 1 meaning perfect separation (clusters are well defined and far away from each other) and -1 meaning no separation.