Edge TPU performance benchmarks

An individual Edge TPU is capable of performing 4 trillion operations (tera-operations) per second (TOPS), using 0.5 watts for each TOPS (2 TOPS per watt). How that translates to performance for your application depends on a variety of factors. Every neural network model has different demands, and if you're using the USB Accelerator device, total performance also varies based on the host CPU, USB speed, and other system resources.

With that said, table 1 below compares the time spent to perform a single inference with several popular models on the Edge TPU. For the sake of comparison, all models running on both CPU and Edge TPU are the TensorFlow Lite versions.

This represents a small selection of model architectures that are compatible with the Edge TPU (they are all trained using the ImageNet dataset with 1,000 classes). If you want to test your own models, read the model architecture requirements.

Note: These figures measure the time required to execute the model only. It does not include the time to process input data (such as down-scaling images to fit the input tensor), which can vary between systems and applications. These tests are also performed using C++ benchmark tests, whereas our public Python benchmark scripts may be slower due to overhead from Python.

**Table 1.** Time per inference, in milliseconds (ms)
Model architecture	Desktop CPU ¹	Desktop CPU ¹ + USB Accelerator (USB 3.0) with Edge TPU	Embedded CPU ²	Dev Board ³ with Edge TPU
Unet Mv2 (128x128)	27.7	3.3	190.7	5.7
DeepLab V3 (513x513)	394	52	1139	241
DenseNet (224x224)	380	20	1032	25
Inception v1 (224x224)	90	3.4	392	4.1
Inception v4 (299x299)	700	85	3157	102
Inception-ResNet V2 (299x299)	753	57	2852	69
MobileNet v1 (224x224)	53	2.4	164	2.4
MobileNet v2 (224x224)	51	2.6	122	2.6
MobileNet v1 SSD (224x224)	109	6.5	353	11
MobileNet v2 SSD (224x224)	106	7.2	282	14
ResNet-50 V1 (299x299)	484	49	1763	56
ResNet-50 V2 (299x299)	557	50	1875	59
ResNet-152 V2 (299x299)	1823	128	5499	151
SqueezeNet (224x224)	55	2.1	232	2
VGG16 (224x224)	867	296	4595	343
VGG19 (224x224)	1060	308	5538	357
EfficientNet-EdgeTpu-S*	5431	5.1	705	5.5
EfficientNet-EdgeTpu-M*	8469	8.7	1081	10.6
EfficientNet-EdgeTpu-L*	22258	25.3	2717	30.5

¹ Desktop CPU: Single 64-bit Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
² Embedded CPU: Quad-core Cortex-A53 @ 1.5GHz
³ Dev Board: Quad-core Cortex-A53 @ 1.5GHz + Edge TPU

* Latency on CPU is high for these models because the TensorFlow Lite runtime is not fully optimized for quantized models on all platforms.

Is this content helpful?