Recently I’ve been reading, testing and writing a bit about edge computing (like here, and here). Personally, I have a main focus on edge AI. With cool new hardware hitting the shelfs recently, I was eager to compare performance of the new platforms and even test them against high performance systems.
The main devices I’m interested in are the new NVIDIA Jetson Nano (128CUDA) and the Google Coral Edge TPU (USB Accelerator). I will also be testing an i7-7700K+ GTX1080(2560CUDA), a Raspberry Pi 3B+ and my own old workhorse, a 2014 MacBook Pro, containing an i7–4870HQ (without CUDA enabled cored).
I will be using MobileNetV2 as a classifier, pre-trainend on the ImageNet dataset. I use this model straight from Keras, which I use with a TensorFlow backend. With the floating point weights for the GPU’s and an 8-bit quantised tflite version of this for the CPU’s and the Coral Edge TPU. (If it is unclear why I don’t use an 8-bit model for the GPU’s, keep on reading, I will talk about this).
First, the model and an image of a magpie are loaded. Then, we execute one prediction as a warmup (because I noticed the first prediction was always a lot slower than the next ones) and let it sleep for 1s, so that all threads are certainly finished. Then the script goes for it and does 250 classifications of that same image. By using the same image for all classifications, we assure that the data will stay close to the CPU throughout the test. After all, we are interested in inference speeds, not the ability to load random data faster.
Straight to the performance point
Nobody likes waiting and, let’s be honest, most of you will mainly be interested in the results, so here we go:
The scoring with the quantized tflite model for CPU was different, but it always seemed to return the same prediction as the others. I guess that’s something weird in the model, and I’m pretty sure it doesn’t affect performance.
Now, because the results are so different for different platforms, it’s kind of hard to visualise. Here are a few graphs, choose your favourite…
Straight away, there are 3 bars in the first graph that jump into view. (Yes, the first graph, linear scale fps, is my favourite, because it shows the difference in the high performance results). Of these 3 bars, 2 of them were achieved by the Google Coral Edge TPU USB accelerator and the 3rd one was a full blown NVIDIA GTX1080 assisted by an Intel i7–7700K. Look a bit closer and you’ll see the GTX1080 actually got beaten by the Coral. Let that sink in for a few seconds and then prepare to be blown away... That GTX1080 draws a maximum of 180W, which is absolutely HUGE compared to the Corals 2.5W.
You managed to stand up again already? Ok, let’s go on:
Next thing we see, is that the NVIDIA Jetson Nano isn’t scoring good at all. Although it has a CUDA enabled GPU, it’s really not much faster than my old i7–4870HQ. But that’s the catch: "not much faster" is still faster than a 50W, quad-core, hyperthreading CPU. From a few years back, true, but still. The Jetson Nano never could have consumed more then a short term average of 12.5W, because that’s what I’m powering it with. That’s a 75% power reduction, with a 10% performance increase.
Clearly, the Raspberry Pi on its own isn’t anything impressive. Not with the floating point model and still not really anything useful with the quantised model. But hey, I had the files ready anyway and it was capable of running the tests, so more is always better right? Still kind of interesting because it shows the difference between the ARM Cortex A53 in the Pi and the A57 in the Jetson Nano.
NVIDIA Jetson Nano
So the Jetson Nano isn’t pumping out impressive fps rates with the MobileNetV2 classifier. But as I already stated, that doesn’t mean it isn’t a great piece of useful engineering. It’s cheap, it doesn’t need a shitload of energy to run and maybe the most important property is that it runs TensorFlow GPU (or any other ML platform) like any other machine you’ve always been using before. As long as your script isn’t diving too deep into CPU architectures, you can run the exact same script you would on an i7+CUDA GPU, also for training! I do still feel like NVIDIA should preload L4T with TensorFlow, but I’ll try not to rage about this any longer. After all, they have a nice explanation on how to install it (don’t be fooled though, TensorFlow 1.12 is not supported, only 1.13.1).
Google Coral Edge TPU
Ok I have a big love for nicely engineered and high efficiency specific electronic devices, so I’m maybe not completely objective. But this thing… It’s a thing of absolute beauty!
The Edge TPU is what we call an “ASIC” (Application Specific Integrated Circuit), which means that it has a combination of small electronic parts such as FET’s and capacities burned directly on the silicon layer, in such a way that it does exactly what it needs to do to speed up inference.
Inference, yes, the Edge TPU is not able to perform backwards propagation. So training your model will still need to be done on a different (preferably CUDA enabled) machine.
The logic behind this sounds more complex than it is though. (Actually creating the hardware, and making it work, is a whole different thing, and is very, very complex. But the logic functions are much simpler). Next image shows the basic principle around which the Edge TPU has been designed.
A net like MobileNetV2 is consisting mostly of convolutions with activation layers behind. A convolution is stated as :
This means nothing more happens than multiplying each element (pixel) of the image with every pixel of the kernel and then adding these results up to create a new ‘image’ (feature map). That is exactly what the main component of the Edge TPU was meant for. Multiplying everything at the same time, then adding it all up at insane speeds. There is no "CPU" behind this, it just does that whenever you pump data into the buffers on the left. If you’re really interested in how this works, look up “Digital Circuit” and “FPGA”, and you’ll probably find enough information to keep you busy for the next few months. It's sometimes rather complex to start with, but really really interesting!
This is exactly why the Coral is in such a different league when comparing performance/Watt numbers, it is a bunch of electronics, designed to do exactly the bitwise operations needed, basically no overhead at all.
Why no 8-bit model for GPU?
A GPU is inherently designed as a fine grained parallel float calculator. Using floats is exactly what it was created for and what it's good at. The Edge TPU has been designed to do 8-bit stuff and CPUs have clever ways of being faster with 8-bit stuff than full bitwitdh floats because they have to deal with this in a lot of cases.
I could give you a lot of reason’s why MobileNetV2 is a good model, but the main reason is because it’s one of the pre-compiled models that Google made available for the Edge TPU.
What else is available on the Edge TPU?
It used to be just MobileNet and Inception in their different versions, but as of the end of last week, Google pushed an update which allowed us to compile custom TensorFlow Lite models. But the limit is, and will probably always be, TensorFlow Lite models. That is different with the Jetson Nano, that thing runs anything you can imagine.
Raspberry Pi + Coral vs the rest
Why does the Coral seem so much slower when connected to a Raspberry Pi? Answer is simple and straight forward: Raspberry Pi has only USB 2.0 ports, the rest has USB 3.0 ports. Since we can see the i7–7700K is faster with the Coral then the Jetson Nano, but still doesn’t seem to score as good as the Coral Dev Board did when NVIDIA tested it, we can conclude the bottleneck is data rate and not the Edge TPU.
Ok, I’m the last one left in the office by now, I think this has been long enough for me, and probably for you as well. I have been absolutely blown away by the power of the Google Coral Edge TPU, but to me, the most interesting setup here was the NVIDIA Jetson Nano in combination with the Coral USB Accelerator. I will most certainly use that setup. It definitely feels like a dream to work with. It has better performance than a Raspberry Pi (for triple the cost though). But most importantly has the USB3.0 ports, which gives a huge performance boost to the Coral USB accelerator.
I hope you had an interesting read. If there are any remarks or questions, do not hesitate to contact me. As usual, this is also where I tell you I will probably write something new soon. Keep your eyes open and all that. Cheers!