Edge TPU Compiler

The Edge TPU Compiler (edgetpu_compiler) is a command line tool that compiles a TensorFlow Lite model (.tflite file) into a file that's compatible with the Edge TPU. This page describes how to use the compiler and a bit about how it works.

Before using the compiler, be sure you have a model that's compatible with the Edge TPU. For compatibility details, read TensorFlow models on the Edge TPU.

System requirements

The Edge TPU Compiler can be run on any modern Debian-based Linux system. Specifically, the following:

  • Debian 6.0 or higher, or any derivative thereof (such as Ubuntu 10.0+)
  • System architecture of x86-64 or ARM64 with ARMv8 instruction set

This includes support for Mendel on the Coral Dev Board.

Download

You can install the compiler on your Linux system with the following commands:

curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -

echo "deb https://packages.cloud.google.com/apt coral-edgetpu-stable main" | sudo tee /etc/apt/sources.list.d/coral-edgetpu.list

sudo apt-get update

sudo apt-get install edgetpu

Usage

edgetpu_compiler [options] model...

The compiler accepts the file path to one or more TensorFlow Lite models (the model argument), plus any options. If you pass multiple models (each separated with a space), they are co-compiled such that they can share the Edge TPU's RAM for parameter data caching (read below about parameter data caching).

The filename for each compiled model is input_filename_edgetpu.tflite, and it is saved to the current directory, unless you specify otherwise with the -out_dir option.

Table 1. Available compiler options
Option Description
-o, --out_dir dir Output the compiled model and log files to directory dir.

Default is the current directory.

-m, --min_runtime_version val The lowest Edge TPU runtime version you want the model to be compatible with. For example, if the device where you plan to execute your model has version 10 of the Edge TPU runtime (and you can't update the runtime version), then you should set this to 10 to ensure your model will be compatible. (Models are always forward-compatible with newer Edge TPU runtimes; a model compiled for version 10 runtime is compatible with version 12.)

The default value depends on your version of the compiler; check the --help output. See below for more detail about the compiler and runtime versions.

-s, --show_operations Print the log showing operations that mapped to the Edge TPU.

The same information is always written in a .log file with the same name and location as the compiled model.

-v, --version Print the compiler version and exit.
-h, --help Print the command line help and exit.

Parameter data caching

The Edge TPU has roughly 8 MB of SRAM that can cache the model's parameter data. However, a small amount of the RAM is first reserved for the model's inference executable, so the parameter data uses whatever space remains after that. Naturally, saving the parameter data on the Edge TPU RAM enables faster inferencing speed compared to fetching the parameter data from external memory.

This Edge TPU "cache" is not actually traditional cache—it's compiler-allocated scratchpad memory. The Edge TPU Compiler adds a small executable inside the model that writes a specific amount of the model's parameter data to the Edge TPU RAM (if available) before running an inference.

When you compile models individually, the compiler gives each model a unique "caching token" (a 64-bit number). Then when you execute a model, the Edge TPU runtime compares that caching token to the token of the data that's currently cached. If the tokens match, the runtime uses that cached data. If they don't match, it wipes the cache and writes the new model's data instead. (When models are compiled individually, only one model at a time can cache its data.) This process is illustrated in figure 1.

Figure 1. Flowchart showing how the Edge TPU runtime manages model cache in the Edge TPU RAM

Notice that the system clears the cache and writes model data to cache only when necessary, and doing so delays the inference. So the first time your model runs is always slower. Any later inferences are faster because they use the cache that's already written. But if your application constantly switches between multiple models, this cache swapping adds significant overhead to your application's overall performance. That's where the co-compilation feature comes in...

Co-compiling multiple models

To speed up performance when you continuously run multiple models on the same Edge TPU, the compiler supports co-compilation. Essentially, co-compiling your models allows multiple models to share the Edge TPU RAM to cache their parameter data together, eliminating the need to clear the cache each time you run a different model.

When you pass multiple models to the compiler, each compiled model is assigned the same caching token. So now when you run your second model for the first time, it can write its data to the cache without clearing it first. Look again at figure 1—this is when the second decision node ("Does the model have cache to write?") becomes "Yes."

But beware that the amount of RAM allocated to each model is fixed at compile-time, and it's prioritized based on the order the models appear in the compiler command. For example, consider if you co-compile two models as shown here:

edgetpu_compiler model_A.tflite model_B.tflite

In this case, cache space is first allocated to model A's data (as much as can fit). If space remains after that, cache is given to model B's data. If some of the model data cannot fit into the Edge TPU RAM, then it must instead be fetched from the external memory at run time.

If you co-compile several models, it's possible some models don't get any cache, so they must load all data from external memory. Yes, that's slower than using the cache, but if you're running the models in quick succession, this could still be faster than swapping the cache every time you run a different model.

Note: Parameter data is allocated to cache one layer at a time—either all parameter data from a given layer fits into cache and is written there, or that layer's data is too big to fit and all data for that layer must be fetched from external memory.

Performance considerations

It's important to remember that the cache allocated to each model is not traditional cache, but compiler-allocated scratchpad memory.

The Edge TPU Compiler knows the size of the Edge TPU RAM, and it knows how much memory is needed by each model's executable and parameter data. So the compiler assigns a fixed amount of cache space for each model's parameter data at compile-time. The edgetpu_compiler command prints this information for each model given. For example, here's a snippet of the compiler output for one:

On-chip memory available for caching model parameters: 6.91MiB
On-chip memory used for caching model parameters: 4.21MiB
Off-chip memory used for streaming uncached model parameters: 0.00B

In this case, the model's parameter data all fits into the Edge TPU RAM: the amount shown for "Off-chip memory used" is zero.

However, if you co-compile two models, then this first model uses 4.21 MiB of the available 6.91 MiB of RAM, leaving only 2.7 MiB for the second model. If that's not enough space for all parameter data, then the rest must be fetched from the external memory. In this case, the compiler prints information for the second model such as this:

On-chip memory available for caching model parameters: 2.7MiB
On-chip memory used for caching model parameters: 2.49MiB
Off-chip memory used for streaming uncached model parameters: 4.25MiB

Notice the amount of "Off-chip memory used" for this second model is 4.25 MiB. This scenario is roughly illustrated in figure 2.

Note: The "On-chip memory available" that appears for the first co-compiled model is what's left after setting aside memory required by the model executables. If you co-compile multiple models, the space set aside for executables is shared between all models (unlike space for parameter data). That is, the amount given for the executables is only the amount of space required by the largest executable (not the sum of all executables).
Figure 2. Two co-compiled models that cannot both fit all parameter data on the Edge TPU RAM

Even if your application then runs only this second model ("model B"), it will always store only a portion of its data on the Edge TPU RAM, because that's the amount determined to be available when you co-compiled it with another model ("model A").

The main benefit of this static design is that your performance is deterministic when your models are co-compiled, and time is not spent frequently rewriting the RAM. And, of course, if your models do fit all parameter data into the Edge TPU RAM, then you achieve maximum performance gains by never reading from external memory and never rewriting the Edge TPU RAM.

When deciding whether to use co-compilation, you should run the compiler with all your models to see whether they can fit all parameter data into the Edge TPU RAM (read the compiler output). If they can't all fit, consider how frequently each model is used. Perhaps pass the most-often-used model to the compiler first, so it can cache all its parameter data. If they can't all fit and they switch rarely, then perhaps co-compilation is not beneficial because the time spent reading from external memory is more costly than periodically rewriting the Edge TPU RAM. To decide what works best, you might need to test different compilation options.

Tip: If the data for multiple models doesn't all fit in the cache, try passing the models to edgetpu_compiler in a different order. As mentioned above, the data is allocated one layer at a time. Thus, you might find an order that fits more total data in the cache because it allows in more of the smaller layers.
Caution: You must be careful if you use co-compilation in combination with multiple Edge TPUs—if you co-compile your models but they actually run on separate Edge TPUs, your models might needlessly store parameter data on the external memory. So you should be sure that any co-compiled models actually run on the same Edge TPU.

Compiler and runtime versions

A model compiled for the Edge TPU must be executed using a corresponding version of the Edge TPU runtime. If you try to run a recently compiled model on an older runtime, then you'll see an error such as this:

Failed precondition: Package requires runtime version (12), which is newer than this runtime version (10).

To solve this, update your runtime version on the device: see here for Dev Board and see here for USB Accelerator.

If you're unable to update the device runtime, you can instead re-compile your model to make it compatible with the older runtime version by including the --min_runtime_version flag when you run edgetpu_compiler. For example:

edgetpu_compiler --min_runtime_version 10 your_model.tflite

The following table shows the Edge TPU Compiler versions and the corresponding Edge TPU runtime version that's required by default. You can always use a newer compiler to create models compatible with older runtimes as described above.

Compiler versionRuntime version required (by default)
2.012
1.010

You can check your compiler version like this:

edgetpu_compiler --version

You can check the runtime version on your device like this:

python3 -c "import edgetpu.basic.edgetpu_utils; print(edgetpu.basic.edgetpu_utils.GetRuntimeVersion())"

Help

If the log for your compiled model shows lots of operations that are "Mapped to the CPU," then carefully review the model requirements and try making changes to increase the number of operations that are mapped to the Edge TPU.

If the compiler completely fails to compile your model and it prints a generic error message, please contact us to report the issue. When you do, please share the TensorFlow Lite model that you're trying to compile so we can debug the issue (we don’t need the fully trained model—one with randomly initialized parameters should be fine).