TensorFlow models on the Edge TPU

In order for the Edge TPU to provide high-speed neural network performance with a low-power cost, the Edge TPU supports a specific set of neural network operations and architectures. This page describes what types of models are compatible with the Edge TPU, how you can create them, and a bit about how to run them.

Compatibility overview

The Edge TPU is capable of executing deep feed-forward neural networks such as convolutional neural networks (CNN). It supports only TensorFlow Lite models that are fully 8-bit quantized and then compiled specifically for the Edge TPU.

If you're not familiar with TensorFlow Lite, it's a lightweight version of TensorFlow designed for mobile and embedded devices. It achieves low-latency inference in a small binary size—both the TensorFlow Lite models and interpreter kernels are much smaller. You cannot train a model directly with TensorFlow Lite; instead you must convert your model from a TensorFlow file (such as a .pb file) to a TensorFlow Lite file (a .tflite file), using the TensorFlow Lite converter.

TensorFlow supports a model optimization technique called quantization, which is required by the Edge TPU. Quantizing your model means converting all the 32-bit floating-point numbers (such as weights and activation outputs) to the nearest 8-bit fixed-point numbers. This makes the model smaller and faster. And although these 8-bit representations can be less precise, the inference accuracy of the neural network is not significantly affected.

Note: The Edge TPU does not support models built using post-training quantization, because although it creates a smaller model by using 8-bit values, it converts them back to 32-bit floats during inference. So you must use quantization-aware training, which uses "fake" quantization nodes to simulate the effect of 8-bit values during training, thus allowing inferences to run using the quantized values. This technique makes the model more tolerant of the lower precision values, which generally results in a higher accuracy model (compared to post-training quantization).

Figure 1 illustrates the basic process to create a model that's compatible with the Edge TPU. Most of the workflow uses standard TensorFlow tools. Once you have a TensorFlow Lite model, you then use our Edge TPU compiler to create a .tflite file that's compatible with the Edge TPU.

Figure 1. The basic workflow to create a model for the Edge TPU

However, you don't need to follow this whole process to create a good model for the Edge TPU. Instead, you can leverage existing TensorFlow models that are compatible with the Edge TPU by retraining them with your own dataset. For example, MobileNet is a popular image classification/detection model architecture that's compatible with the Edge TPU. We've created several versions of this model that you can use as a starting point to create your own model that recognizes different objects. To get started, see the section below about how to retrain an existing model with transfer learning.

But if you have designed—or plan to design—your own model from scratch, then you should read the next section about model requirements.

Model requirements

If you want to build your own TensorFlow model that takes full advantage of the Edge TPU at runtime, it must meet the following requirements:

  • Tensor parameters are quantized (8-bit fixed-point numbers). You must use quantization-aware training (post-training quantization is not supported).
  • Tensor sizes are constant at compile-time (no dynamic sizes).
  • Model parameters (such as bias tensors) are constant at compile-time.
  • Tensors are either 1-, 2-, or 3-dimensional. If a tensor has more than 3 dimensions, then only the 3 innermost dimensions may have a size greater than 1.
  • The model uses only the operations supported by the Edge TPU (see table 1 below).

If your model does not meet these requirements entirely, it can still compile, but only a portion of the model will execute on the Edge TPU. At the first point in the model graph where an unsupported operation occurs, the compiler partitions the graph into two parts. The first part of the graph that contains only supported operations is compiled into a custom operation that executes on the Edge TPU, and everything else executes on the CPU, as illustrated in figure 2.

Note: Currently, the Edge TPU compiler cannot partition the model more than once, so as soon as an unsupported operation occurs, that operation and everything after it executes on the CPU, even if supported operations occur later.
Figure 2. The compiler creates a single custom op for all Edge TPU compatible ops; anything else stays the same

If you inspect your compiled model (with a tool such as visualize.py), you'll see that it's still a TensorFlow Lite model except it now has a custom operation at the beginning of the graph. This custom operation is the only part of your model that is actually compiled—it contains all the operations that run on the Edge TPU. The rest of the graph (beginning with the first unsupported operation) remains the same.

If part of your model executes on the CPU, you should expect a significantly degraded inference speed compared to a model that executes entirely on the Edge TPU. We cannot predict how much slower your model will perform in this situation, so you should experiment with different architectures and strive to create a model that is 100% compatible with the Edge TPU. That is, your compiled model should contain only the Edge TPU custom operation.

Note: When compilation completes, the Edge TPU compiler tells you how many operations can execute on the Edge TPU and how many must instead execute on the CPU (if any at all). But beware that the percentage of operations that execute on the Edge TPU versus the CPU does not correspond to the overall performance impact—if even a small fraction of your model executes on the CPU, it can potentially slow the inference speed by an order of magnitude (compared to a version of the model that runs entirely on the Edge TPU).
Table 1. All operations supported by the Edge TPU and any known limitations
Operation name Known limitations
AveragePool2d No fused activation function.
Concatenation No fused activation function.
If any input is a compile-time constant tensor, there must be only 2 inputs, and this constant tensor must be all zeroes (effectively, a zero-padding op).
FullyConnected Only default format supported for fully-connected weights. Output tensor is one-dimensional.
MaxPool2d No fused activation function.
Mean Supports reduction along x- and/or y-dimensions only.
Pad Supports padding along x- and/or y-dimensions only.
ResizeBilinear Input/output is a 3-dimensional tensor. Depending on input/output size, this operation may not be mapped to the Edge TPU to avoid loss in precision.
ResizeNearestNeighbor Input/output is a 3-dimensional tensor. Depending on input/output size, this operation may not be mapped to the Edge TPU to avoid loss in precision.
Softmax Supports only 1D input tensor with a max of 16,000 elements.
Squeeze Supported only when input tensor dimensions that have leading 1s (that is, no relayout needed). For example input tensor with [y][x][z] = 1,1,10 or 1,5,10 is ok. But [y][x][z] = 5,1,10 is not supported.
StridedSlice Supported only when all strides are equal to 1 (that is, effectively a Stride op), and with ellipsis-axis-mask == 0, and new-axis-max == 0.

Transfer learning

Instead of building your own model to conform to the above requirements and then train it from scratch, you can retrain an existing model that's already compatible with the Edge TPU, using a technique called transfer learning (sometimes also called "fine tuning").

Training a neural network from scratch (when it has no computed weights or bias) can take days-worth of computing time and requires a vast amount of training data. But transfer learning allows you to start with a model that's already trained for a related task and then perform further training to teach the model new classifications using a smaller training dataset. You can do this by retraining the whole model (adjusting the weights across the whole network), but you can also achieve very accurate results by simply removing the final layer that performs classification, and training a new layer on top that recognize your new classes.

Using this process, with sufficient training data and some adjustments to the hyperparameters, you can create a highly accurate TensorFlow model in a single sitting. Once you're happy with the model's performance, simply convert it to TensorFlow Lite and then compile it for the Edge TPU. And because the model architecture doesn't change during transfer learning, you know it will fully compile for the Edge TPU (assuming you start with a compatible model).

If you're already familiar with transfer learning, check out our Edge TPU-compatible models that you can use as a starting point to create your own model. Just click to download "All model files" to get the TensorFlow model and pre-trained checkpoints you need to begin transfer learning.

If you're new to this technique and want to quickly see some results, try the following tutorials that simplify the process to retrain a MobileNet model with new classes:

You can even retrain a model directly on your device with the Edge TPU, using a slightly different technique called weight imprinting through our Python API:

Edge TPU runtime and APIs

To execute your model on the Edge TPU, you need the Edge TPU runtime and API library installed on the host system.

If you're using the Coral Dev Board or SoM, the Edge TPU runtime and API library is already provided in the Mendel operating system. If you're using an accessory device such as the Coral USB Accelerator, you must install both onto the host computer—only Debian-based operating systems are currently supported (see the setup instructions).

From your host system, you can perform an inference using one of the following APIs provided with the Edge TPU library:

  • Edge TPU C++ API: This is just a small extension (edgetpu.h) for the TensorFlow Lite C++ API, so you'll mostly be using the latter to execute an inference with your model.
  • Edge TPU Python API: This is a wrapper for the C++ API that adds several convenience APIs not included in C++, such as ClassificationEngine, which allows you to perform image classification by simply providing a compiled .tflite model and the image you want to classify.