The Edge TPU Compiler (
edgetpu_compiler) is a command line tool that compiles a TensorFlow Lite
.tflite file) into a file that's compatible with the Edge TPU. This page describes how to
use the compiler and a bit about how it works.
Before using the compiler, be sure you have a model that's compatible with the Edge TPU. For compatibility details, read TensorFlow models on the Edge TPU.
The Edge TPU Compiler can be run on any modern Debian-based Linux system. Specifically, the following:
- Debian 6.0 or higher, or any derivative thereof (such as Ubuntu 10.0+)
- System architecture of x86-64 or ARM64 with ARMv8 instruction set
This includes support for Mendel on the Coral Dev Board.
If you don't have a compatible system, you can instead use the online Edge TPU compiler.
You can install the compiler on your Linux system with the following commands:
echo "deb https://packages.cloud.google.com/apt coral-edgetpu-stable main" | sudo tee /etc/apt/sources.list.d/coral-edgetpu.list sudo apt update sudo apt install edgetpu
edgetpu_compiler [options] model...
The compiler accepts the file path to one or more TensorFlow Lite models (the
plus any options. If you pass multiple models (each separated with a space), they are co-compiled
such that they can share the Edge TPU's RAM for parameter data caching (read below about parameter
The filename for each compiled model is
it is saved to the current directory, unless you specify otherwise with the
||Output the compiled model and log files to directory dir.
Default is the current directory.
||The minimum Edge TPU runtime version required by your model. Models are forward-compatible with
new Edge TPU runtimes. For example, if the device where you plan to execute your model has version
10 of the Edge TPU runtime, then you should set this to 10.
The default depends on your version of the compiler; check the
||Print the log showing operations that mapped to the Edge TPU.
The same information is always written in a
||Print the compiler version and exit.|
||Print the command line help and exit.|
Parameter data caching
The Edge TPU has roughly 8 MB of SRAM that can cache the model's parameter data. However, a small amount of the RAM is first reserved for the model's inference executable, so the parameter data uses whatever space remains after that. Naturally, saving the parameter data on the Edge TPU RAM enables faster inferencing speed compared to fetching the parameter data from external memory.
This Edge TPU "cache" is not actually traditional cache—it's compiler-allocated scratchpad memory. The Edge TPU Compiler adds a small executable inside the model that writes a specific amount of the model's parameter data to the Edge TPU RAM (if available) before running an inference.
When you compile models individually, the compiler gives each model a unique "caching token" (a 64-bit number). Then when you execute a model, the Edge TPU runtime compares that caching token to the token of the data that's currently cached. If the tokens match, the runtime uses that cached data. If they don't match, it wipes the cache and writes the new model's data instead. (When models are compiled individually, only one model at a time can cache its data.) This process is illustrated in figure 1.
Notice that the system clears the cache and writes model data to cache only when necessary, and doing so delays the inference. So the first time your model runs is always slower. Any later inferences are faster because they use the cache that's already written. But if your application constantly switches between multiple models, this cache swapping adds significant overhead to your application's overall performance. That's where the co-compilation feature comes in...
Co-compiling multiple models
To speed up performance when you continuously run multiple models on the same Edge TPU, the compiler supports co-compilation. Essentially, co-compiling your models allows multiple models to share the Edge TPU RAM to cache their parameter data together, eliminating the need to clear the cache each time you run a different model.
When you pass multiple models to the compiler, each compiled model is assigned the same caching token. So now when you run your second model for the first time, it can write its data to the cache without clearing it first. Look again at figure 1—this is when the second decision node ("Does the model have cache to write?") becomes "Yes."
But beware that the amount of RAM allocated to each model is fixed at compile-time, and it's prioritized based on the order the models appear in the compiler command. For example, consider if you co-compile two models as shown here:
edgetpu_compiler model_A.tflite model_B.tflite
In this case, cache space is first allocated to model A's data (as much as can fit). If space remains after that, cache is given to model B's data. If some of the model data cannot fit into the Edge TPU RAM, then it must instead be fetched from the external memory at run time.
If you co-compile several models, it's possible some models don't get any cache, so they must load all data from external memory. Yes, that's slower than using the cache, but if you're running the models in quick succession, this could still be faster than swapping the cache every time you run a different model.
It's important to remember that the cache allocated to each model is not traditional cache, but compiler-allocated scratchpad memory.
The Edge TPU Compiler knows the size of the Edge TPU RAM, and it knows how much memory is needed by
each model's executable and parameter data. So the compiler assigns a fixed amount of cache space
for each model's parameter data at compile-time. The
edgetpu_compiler command prints this
information for each model given. For example, here's a snippet of the compiler output for one:
On-chip memory available for caching model parameters: 6.91MiB On-chip memory used for caching model parameters: 4.21MiB Off-chip memory used for streaming uncached model parameters: 0.00B
In this case, the model's parameter data all fits into the Edge TPU RAM: the amount shown for "Off-chip memory used" is zero.
However, if you co-compile two models, then this first model uses 4.21 MiB of the available 6.91 MiB of RAM, leaving only 2.7 MiB for the second model. If that's not enough space for all parameter data, then the rest must be fetched from the external memory. In this case, the compiler prints information for the second model such as this:
On-chip memory available for caching model parameters: 2.7MiB On-chip memory used for caching model parameters: 2.49MiB Off-chip memory used for streaming uncached model parameters: 4.25MiB
Notice the amount of "Off-chip memory used" for this second model is 4.25 MiB. This scenario is roughly illustrated in figure 2.
Even if your application then runs only this second model ("model B"), it will always store only a portion of its data on the Edge TPU RAM, because that's the amount determined to be available when you co-compiled it with another model ("model A").
The main benefit of this static design is that your performance is deterministic when your models are co-compiled, and time is not spent frequently rewriting the RAM. And, of course, if your models do fit all parameter data into the Edge TPU RAM, then you achieve maximum performance gains by never reading from external memory and never rewriting the Edge TPU RAM.
When deciding whether to use co-compilation, you should run the compiler with all your models to see whether they can fit all parameter data into the Edge TPU RAM (read the compiler output). If they can't all fit, consider how frequently each model is used. Perhaps pass the most-often-used model to the compiler first, so it can cache all its parameter data. If they can't all fit and they switch rarely, then perhaps co-compilation is not beneficial because the time spent reading from external memory is more costly than periodically rewriting the Edge TPU RAM. To decide what works best, you might need to test different compilation options.
edgetpu_compilerin a different order. As mentioned above, the data is allocated one layer at a time. Thus, you might find an order that fits more total data in the cache because it allows in more of the smaller layers.
If the log for your compiled model shows lots of operations that are "Mapped to the CPU," then carefully review the model requirements and try making changes to increase the number of operations that are mapped to the Edge TPU.
If the compiler completely fails to compile your model and it prints a generic error message, please contact us to report the issue. When you do, please share the TensorFlow Lite model that you're trying to compile so we can debug the issue (we don’t need the fully trained model—one with randomly initialized parameters should be fine).
Is this content helpful?