Gotta Catch ’em All: Pokémon Types AI Recognition

When I was a little kid, just like all of my peers, I was genuinely obsessed with Pocket Monsters – or Pokémon, for short. One of the many qualities those creatures had, which was mesmerising for me, was how many different types of them was out there. To name just a few, we had Electric Pokémon, Fire Pokémon, Ghost Pokémon, etc. And that was many years ago! Since then, Pokémon gained themselves some new classes, and right now there are as many as 18 of them!

So, just as I’d been wandering in my childhood, lately I was thinking what exactly makes different types of Pokémon… well, different. And most importantly, is there a way to recognise them based just on their looks?

And that’s why I’ve decided to check it out!

After brief research, I’ve discovered that I wasn’t the only one who got this idea (which is not surprising, considering how much of a world-class phenomenon those creatures are). 

My primary source of inspiration for the further journey had been the article published in Journal of Geek Studies by Henrique M. Soares in 2017. In his piece, Mr Soares set himself a task similar to mine – with few differences.

First, he focused only on how Pokémon looked like in their series of games. Secondly, his model’s accuracy of recognising the type of a Pokémon based on its look was below 40%.

Considering the first issue, it forced Mr Soares to use a small number of images, and each one of those had a relatively low resolution. Also, those images didn’t have a lot of variation between themselves, considering that they all came from a similar source. Usually, as with any other type of graphics in the real world, we can observe different kinds of artwork in them, with different attention to details.

Considering the second issue, well… It was straightforward – I was wondering if I could improve his result, especially using more of a „real-world” kind of data.

So, here my journey began!

Data Collecting

The first step of every process in data analysis looks the same – collect the data. To gather inputs for my mini-project, I’ve decided to combine three different sources.

First was a dataset I’ve obtained from Kaggle, containing nearly 10,000 samples of the first 151 Pokémon. Second, also from Kaggle, was a dataset containing 819 images of every currently known Pokémon, one image per Pokémon.

To add some more samples to my set, the third source I’ve decided to use had been built from random images from Bing search engine. 

To collect images from Bing I’ve used a tutorial from PyImageSearch. Bing provides a simple to use API, which is free for the first seven days and that was more than enough for my project. 

Summing up, as a result, we got three different directories – each one with a different kind of structure. In contrary to the last source, build entirely by me, in which we got different folders for each type of Pokémon, in the two previous datasets we got directories with names corresponding to names of Pokémon – not their types. And we have to change that to proceed with our work.

Data Preprocessing

Firstly, we have to know precisely what type each Pokémon have. As a source of this information, I’ve used another Kaggle dataset, containing stats of each of 721 Pokémon. Now, we have to categorise our data using this information, which can be quickly done in Python.

Let’s do this for the first of our Kaggle datasets:

import shutil
import os
import csv

file = "./Pokemon/data/pokemon_types_names.csv"
with open(file,'r') as f: 
        reader = csv.reader(f)
        next(reader,None) #Skip the header
        pkm_dict1 = {rows[1]:rows[2] for rows in reader}

for pkm_name, pkm_type in pkm_dict1.items():
            source = './Pokemon/pokemon-generation-one/{pkm_name}/'.format(pkm_name=pkm_name) 
            dest = './Pokemon/dataset/{pkm_type}'.format(pkm_type=pkm_type)
            for f in os.listdir(source):
                if not os.path.exists(dest):
                    os.makedirs(dest)
                shutil.copy(source+f, dest)

And now for the second one:

import shutil
import os
import csv

file = "./Pokemon/data/pokemon_types.csv"
with open(file,'r') as f: 
        reader = csv.reader(f)
        next(reader,None) #Skip the header
        pkm_dict2 = {rows[0]:rows[1] for rows in reader}

for pkm_no, pkm_type in pkm_dict2.items():
            source = './Pokemon/pokemon-images-dataset/{pkm_no}.png'.format(pkm_no=pkm_no) 
            dest = './Pokemon/dataset/{pkm_type}'.format(pkm_type=pkm_type)
            if not os.path.exists(dest):
                os.makedirs(dest)
            shutil.copy(source, dest)

After combining every type-related directory, we have a piece to work with. Let’s take a look at its structure.

dataset
 ├── bug (907 items)
 ├── dark (132 items)
 ├── dragon (311 items)
 ├── electric (930 items)
 ├── fairy (251 items)
 ├── fighting (564 items)
 ├── fire (1130 items)
 ├── flying (113 items)
 ├── ghost (327 items)
 ├── grass (1138 items)
 ├── ground (654 items)
 ├── ice (267 items)
 ├── normal (1695 items)
 ├── poison (1004 items)
 ├── psychic (836 items)
 ├── rock (683 items)
 ├── steel (154 items)
 └── water (2282 items)

Here we can observe a thing that can become a potential problem for us in the future – that our dataset is quite imbalanced. What could be the origin of this issue? We can take a look at the distribution of Pokémon itself to investigate.

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

data = pd.read_csv('/Users/maciejgajewski/Pokemon/data/pokemon_types_source.csv', sep=',', index_col=0)

sns.set(style="darkgrid")
ax = sns.countplot(x="Type_1", data   =data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()

As we can see, Pokémon types themselves are not balanced. There are some types which are more popular (i.e. Water type), and there are those which are less (i.e. Steel type). What’s more, some Pokémon types are usually not the main ones (i.e. we have plenty of secondary-type Flying Pokémon, but not a lot of them have a Flying type as a primary one).

That could become a problem, due to a fact that, as stated by Buda et al., an imbalanced dataset can significantly decrease the accuracy of the convolutional neural network. We could try to artificially resolve this issue, for example by using a generative adversarial network or an oversampling technique (like SMOTE-NC), but again, to keep this project simple, we will try to improve the balancing of our data set only by applying data augmentation.

Neural Network Training

Choosing a right network type

To train our neural network to recognise Pokémon types we will use a variation of a VGGNet. VGGNets originated in 2014, and they usually achieve excellent results in image recognition. However, while their structure is rather simple (they use only 3×3 convolution and 2×2 pooling sizes), they are also very slow to train (due to their requirement of having a lot of weight layers). To overcome this issue, we will build a smaller version of VGGNet, as proposed on PyImageSearch. In order for it to handle our dataset, we will apply some changes to its architecture.

First, to handle the complexity of our dataset and need to obtain a richer set of features, we will add another layer to our network. It will be build of another set of convolution and activation layers stacked on each other, with filter size increased to 256 (to reduce possible overfitting).

Another thing that we will decide to change is the size of images used by our network to train itself. The bigger size of images usually means the bigger set of features for the network to use. However, it also increases the computational complexity of the model to train, which will prove to be a severe obstacle to overcome, as I’m training my network on CPU, on my own computer. That’s why I’ve increased the size of images to only 112×112 (from original 96×96), as it was the top of possibilities for my poor laptop to achieve.

# import the necessary packages
from keras.models import Sequential
from keras.layers.normalization import BatchNormalization
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers.core import Activation
from keras.layers.core import Flatten
from keras.layers.core import Dropout
from keras.layers.core import Dense
from keras import backend as K

class SmallerVGGNet:
	@staticmethod
	def build(width, height, depth, classes):
		# initialize the model along with the input shape to be
		# "channels last" and the channels dimension itself
		model = Sequential()
		inputShape = (height, width, depth)
		chanDim = -1

		# if we are using "channels first", update the input shape
		# and channels dimension
		if K.image_data_format() == "channels_first":
			inputShape = (depth, height, width)
			chanDim = 1

		# CONV => RELU => POOL
		model.add(Conv2D(32, (3, 3), padding="same",
			input_shape=inputShape))
		model.add(Activation("relu"))
		model.add(BatchNormalization(axis=chanDim))
		model.add(MaxPooling2D(pool_size=(3, 3)))
		model.add(Dropout(0.25))

		# (CONV => RELU) * 2 => POOL
		model.add(Conv2D(64, (3, 3), padding="same"))
		model.add(Activation("relu"))
		model.add(BatchNormalization(axis=chanDim))
		model.add(Conv2D(64, (3, 3), padding="same"))
		model.add(Activation("relu"))
		model.add(BatchNormalization(axis=chanDim))
		model.add(MaxPooling2D(pool_size=(2, 2)))
		model.add(Dropout(0.25))

		# (CONV => RELU) * 2 => POOL
		model.add(Conv2D(128, (3, 3), padding="same"))
		model.add(Activation("relu"))
		model.add(BatchNormalization(axis=chanDim))
		model.add(Conv2D(128, (3, 3), padding="same"))
		model.add(Activation("relu"))
		model.add(BatchNormalization(axis=chanDim))
		model.add(MaxPooling2D(pool_size=(2, 2)))
		model.add(Dropout(0.25))

		# (CONV => RELU) * 2 => POOL
		model.add(Conv2D(256, (3, 3), padding="same"))
		model.add(Activation("relu"))
		model.add(BatchNormalization(axis=chanDim))
		model.add(Conv2D(256, (3, 3), padding="same"))
		model.add(Activation("relu"))
		model.add(BatchNormalization(axis=chanDim))
		model.add(MaxPooling2D(pool_size=(2, 2)))
		model.add(Dropout(0.25))

		# first (and only) set of FC => RELU layers
		model.add(Flatten())
		model.add(Dense(1024))
		model.add(Activation("relu"))
		model.add(BatchNormalization())
		model.add(Dropout(0.5))

		# softmax classifier
		model.add(Dense(classes))
		model.add(Activation("softmax"))

		# return the constructed network architecture
		return model


Now, let’s jump to a more technical part for a bit (you can freely skip it if you’re not into that and you’re more concerned about the results themselves).

Tuning the network

In our network, we had to make a few choices to allow it to train itself properly. Besides those choices described above, there are also a few other ones:

Choice of an activation function: as in most cases with CNN, we are mostly concerned if an image belongs to some class, or if it doesn’t. We also don’t have so much of use when it comes to mapping negative values (when the image doesn’t belong to some class – our biggest concern is about the positive values only). That’s why we can quickly pick the rectified linear unit as our activation function. Of course, it’s not differentiable, but as we said earlier, in our small project we only care about positive values. If otherwise, we could use some variation of ReLU (as leaky ReLU, for example), but let’s leave that for another occasion. 

Choice of an optimizer: in CNN, often the default choice for an optimizer is Adam. It outperforms standard stochastic gradient descent and many other optimizers, mostly due to its effectiveness (achieved by continuous bias correction) and efficiency when it comes to memory requirements. It was introduced in 2015, and I will include a link to the original paper for further references.

Choice of a loss function: our goal is to measure the performance of our model. As we have a classification problem on our hands and its output will be a probability of whether an image belongs to some class or not (value between 0 and 1), we will use a cross-entropy function. Because our problem has more than two classes (eighteen of them, to be precise), we will have to use categorical cross-entropy (instead of a binary one).

As for data augmentation parameters, a number of epochs and a number of batches we will stick to more than less standard approach, which will be possible to be seen in the code section (I will link to the whole repository at the end).

So, we have our data, we have our network, and we have our architecture. It’s time to (finally!) train our model. In my case, this part took roughly 30 hours (I’ve told you that my laptop has its limits…). Let’s see how it managed.

As we can see, results are far from being perfect. My model achieved at best 68% of accuracy – not so good, but still better than results gained by Mr Soares. Let’s take a look at some random examples from our classification and try to find out what could be the potential reasons behind this low score.


As we can see, our assumption (made at the beginning of this article) is probably the correct one. Our model had no problems with recognising images coming from the most popular classes (as Fire or Water), but had, however, some difficulties with types with fewer samples in our data (like Steel, Dark or Fairy). 

Especially Fairy class can be seen as a problematic one – it had been created very recently (in 2013, in the 6th generation of Pokémon games) and some of the Pokémon which now belongs to this class (like Togepi, my favourite Pokémon from childhood times, presented on an image above) had been previously treated as Normal type. That’s why in the process of obtaining data we could quickly get some samples with what is now a wrong label, but some time ago was a correct one. 

Other things which solving could improve the accuracy score would be, as stated previously, running it on higher resolution images, adding next layers to our VGGNet or – the most probable factor – trying to get more balanced data set, with keeping a rule of at least ~1000 examples per each class.

Conclusion

When I’ve first had an idea that I could build a neural network which would be capable of recognising (even with flaws) different types of Pokémon, I thought it to be an unreal one. For one, I’ve never before had anything to do neither with convolutional neural networks and nor with computer vision. But here we are. With inspiration and help from a lot of great people, whose works I’ve mentioned earlier in the text, I was able to have a lot of fun & to learn a great deal of information on the way. 

Is my model a perfect one? No. But it never was supposed to be one. I have a lot to learn & I hope that in time I will be able to build a solution which will be recognising Pokémon types with ease. Especially creating artificial samples of image categories (as has been done with the use of GAN in This Person Does Not Exist project) looks promising.

So, I hope that this text will give somebody at least a little part of fun which I had writing it & hopefully it will become some inspiration for you to dream big in whatever is your area of interest.

Have a great day and until the next one!


P.S. Whole code and files related to this project can be found in a repository here. Data, however, can be downloaded from here.