[2K系列]

【广东龙芯2K500先锋板试用体验】运行边缘AI框架——TFLM

2023-1-31 18:33:11 2199 LoongArch GCC tensorflow TensorFlow Lite

【广东龙芯2K500先锋板试用体验】运行边缘AI框架——TFLM

一、TFLM简介

TFLM是TensorFlow Lite for Microcontrollers项目的简称，全称翻译过来就是“适用于微控制器的TensorFlow Lite”。它是一个来自谷歌的边缘AI框架，在单片机上也能够运行。

来自官方的介绍：

TensorFlow Lite for Microcontrollers 是 TensorFlow Lite 的一个实验性移植版本，它适用于微控制器和其他一些仅有数千字节内存的设备。它可以直接在“裸机”上运行，不需要操作系统支持、任何标准 C/C++ 库和动态内存分配。核心运行时(core runtime)在 Cortex M3 上运行时仅需 16KB，加上足以用来运行语音关键字检测模型的操作，也只需 22KB 的空间。

TFLM项目首页：https://tensorflow.google.cn/lite/microcontrollers/overview?hl=zh-cn

TFLM代码仓链接：https://github.com/tensorflow/tflite-micro

二、TFLM使用指南

接下来准备在PC上编译TFLM，并运行基准测试。

首先下载TFLM代码，使用如下命令：

gitclonehttps://github.com/tensorflow/tflite-micro.git

TFLM是一个边缘AI推理框架，可以简单理解为一个计算库；另外，TFLM项目内提供了基准测试，用于对框架进行简单的测试，可以实现用一个AI模型在不同设备上的进行推理，并就各自推理性能进行对比。

2.1 基准测试简介

TFLM代码仓顶层的README.md中给出了基准测试文档链接：

https://github.com/tensorflow/tflite-micro/blob/main/tensorflow/lite/micro/benchmarks/README.md

该文档篇幅不长：

通过这个目录我们可以知道，TFLM提供了两个基准测试（实际有三个），分别是：

关键词基准测试
- 关键词基准测试使用的是程序运行时生产的随机数据作为输入，所以它的输出是没有意义的
人体检测基准测试
- 人体检测基准测试使用了两张bmp图片作为输入
- 具体位于tensorflow\lite\micro\examples\person_detection\testdata子目录

2.2 安装依赖的软件

由于TFLM的过程中，需要下载一些测试数据，并使用Pillow库将部分测试图片转化为C代码。因此，编译TFLM之前需要先安装Pillow库，以及一些命令行工具。

运行TFLM基准测试之前，使用如下命令先安装依赖的一些软件：

sudo apt install python3 python3-pip git unzip wget build-essential

Pillow是一个Python库，因此如果PC的Linux系统上还没有Python则需要安装。

2.2.1 设置pip源

将pip源设置为国内源，可以加速pip包安装，执行如下命令：

pip configsetglobal.index-url  pip configsetglobal.trusted-host mirrors.aliyun.com pip configsetglobal.timeout 120

2.2.2 安装Pillow库

执行如下命令，安装pillow库：

pip install pillow

安装过程会编译pillow包中的C/C++源代码文件，速度较慢，耐心等待。

如果Pillow安装过程报错：The headers or library files could not be found for jpeg

需要先安装libjpeg库：

apt-get install libjpeg-dev zlib1g-dev

2.3 基准测试命令

参考”Run on x86”，在x86 PC上运行关键词基准测试的命令是：

make -f tensorflow/lite/micro/tools/make/Makefile run_keyword_benchmark

在PC上运行人体检测基准测试的命令是：

make -f tensorflow/lite/micro/tools/make/Makefile run_person_detection_benchmark

执行这两个命令，会依次执行如下步骤：

调用几个下载脚本，下载依赖库和数据集；
编译测试程序；
运行测试程序；

tensorflow/lite/micro/tools/make/Makefile代码片段中，可以看到调用了几个下载脚本：

flatbuffers_download.sh和kissfft_download.sh脚本第一次执行时，会将相应的压缩包下载到本地，并解压，具体细节参见代码内容；

pigweed_download.sh脚本会克隆一个代码仓，再检出一个特定版本：

这里需要注意的是，代码仓https://pigweed.googlesource.com/pigweed/pigweed 国内一般无法访问（因为域名googlesource.com被禁了）。将此连接修改为我克隆好的代码仓：https://github.com/xusiwei/pigweed.git 可以解决因为国内无法访问googlesource.com而无法下载pigweed测试数据的问题。

2.4 基准测试构建规则

tensorflow/lite/micro/tools/make/Makefile文件是Makefile总入口文件，该文件中定义了一些makefile宏函数，并通过include引入了其他文件，包括定义了两个基准测试编译规则的tensorflow/lite/micro/benchmarks/Makefile.inc文件：

KEYWORD_BENCHMARK_SRCS := \ tensorflow/lite/micro/benchmarks/keyword_benchmark.cc KEYWORD_BENCHMARK_GENERATOR_INPUTS := \ tensorflow/lite/micro/models/keyword_scrambled.tflite KEYWORD_BENCHMARK_HDRS := \ tensorflow/lite/micro/benchmarks/micro_benchmark.h KEYWORD_BENCHMARK_8BIT_SRCS := \ tensorflow/lite/micro/benchmarks/keyword_benchmark_8bit.cc KEYWORD_BENCHMARK_8BIT_GENERATOR_INPUTS := \ tensorflow/lite/micro/models/keyword_scrambled_8bit.tflite KEYWORD_BENCHMARK_8BIT_HDRS := \ tensorflow/lite/micro/benchmarks/micro_benchmark.h PERSON_DETECTION_BENCHMARK_SRCS := \ tensorflow/lite/micro/benchmarks/person_detection_benchmark.cc PERSON_DETECTION_BENCHMARK_GENERATOR_INPUTS := \ tensorflow/lite/micro/examples/person_detection/testdata/person.bmp \ tensorflow/lite/micro/examples/person_detection/testdata/no_person.bmpifneq($(CO_PROCESSOR),ethos_u) PERSON_DETECTION_BENCHMARK_GENERATOR_INPUTS += \ tensorflow/lite/micro/models/person_detect.tfliteelse# Ethos-U use a Vela optimized version of the original model.PERSON_DETECTION_BENCHMARK_SRCS += \$(GENERATED_SRCS_DIR)tensorflow/lite/micro/models/person_detect_model_data_vela.ccendifPERSON_DETECTION_BENCHMARK_HDRS := \ tensorflow/lite/micro/examples/person_detection/model_settings.h \ tensorflow/lite/micro/benchmarks/micro_benchmark.h# Builds a standalone binary.$(eval$(callmicrolite_test,keyword_benchmark,\$(KEYWORD_BENCHMARK_SRCS),$(KEYWORD_BENCHMARK_HDRS),$(KEYWORD_BENCHMARK_GENERATOR_INPUTS)))# Builds a standalone binary.$(eval$(callmicrolite_test,keyword_benchmark_8bit,\$(KEYWORD_BENCHMARK_8BIT_SRCS),$(KEYWORD_BENCHMARK_8BIT_HDRS),$(KEYWORD_BENCHMARK_8BIT_GENERATOR_INPUTS)))$(eval$(callmicrolite_test,person_detection_benchmark,\$(PERSON_DETECTION_BENCHMARK_SRCS),$(PERSON_DETECTION_BENCHMARK_HDRS),$(PERSON_DETECTION_BENCHMARK_GENERATOR_INPUTS)))

从这里可以看到，实际上有三个基准测试程序，比文档多了一个 keyword_benchmark_8bit ，应该是 keword_benchmark的8bit量化版本。另外，可以看到有三个tflite的模型文件。

2.5 Keyword基准测试

关键词基准测试使用的模型较小，比较适合在STM32 F3/F4这类主频低于100MHz的MCU。

这个基准测试的模型比较小，计算量也不大，所以在PC上运行这个基准测试的耗时非常短：

可以看到，在PC上运行关键词唤醒的速度非常快，10次时间不到1毫秒。

模型文件路径为：./tensorflow/lite/micro/models/keyword_scrambled.tflite

可以使用Netron软件查看模型结构，如下图所示：

2.6 Person detection基准测试

人体检测基准测试的计算量相对要大一些，运行的时间也要长一些：

xu@VirtualBox:~/opensource/tflite-micro$make-ftensorflow/lite/micro/tools/make/Makefilerun_person_detection_benchmarktensorflow/lite/micro/tools/make/downloads/flatbuffersalreadyexists,skippingthedownload.tensorflow/lite/micro/tools/make/downloads/kissfftalreadyexists,skippingthedownload.tensorflow/lite/micro/tools/make/downloads/pigweedalreadyexists,skippingthedownload.g++-std=c++11-fno-rtti-fno-exceptions-fno-threadsafe-statics-Werror-fno-unwind-tables-ffunction-sections-fdata-sections-fmessage-length=0-DTF_LITE_STATIC_MEMORY-DTF_LITE_DISABLE_X86_NEON-Wsign-compare-Wdouble-promotion-Wshadow-Wunused-variable-Wunused-function-Wswitch-Wvla-Wall-Wextra-Wmissing-field-initializers-Wstrict-aliasing-Wno-unused-parameter-DTF_LITE_USE_CTIME-Os-I.-Itensorflow/lite/micro/tools/make/downloads/gemmlowp-Itensorflow/lite/micro/tools/make/downloads/flatbuffers/include-Itensorflow/lite/micro/tools/make/downloads/ruy-Itensorflow/lite/micro/tools/make/gen/linux_x86_64_default/genfiles/-Itensorflow/lite/micro/tools/make/downloads/kissfft-ctensorflow/lite/micro/tools/make/gen/linux_x86_64_default/genfiles/tensorflow/lite/micro/examples/person_detection/testdata/person_image_data.cc-otensorflow/lite/micro/tools/make/gen/linux_x86_64_default/obj/core/tensorflow/lite/micro/tools/make/gen/linux_x86_64_default/genfiles/tensorflow/lite/micro/examples/person_detection/testdata/person_image_data.og++-std=c++11-fno-rtti-fno-exceptions-fno-threadsafe-statics-Werror-fno-unwind-tables-ffunction-sections-fdata-sections-fmessage-length=0-DTF_LITE_STATIC_MEMORY-DTF_LITE_DISABLE_X86_NEON-Wsign-compare-Wdouble-promotion-Wshadow-Wunused-variable-Wunused-function-Wswitch-Wvla-Wall-Wextra-Wmissing-field-initializers-Wstrict-aliasing-Wno-unused-parameter-DTF_LITE_USE_CTIME-Os-I.-Itensorflow/lite/micro/tools/make/downloads/gemmlowp-Itensorflow/lite/micro/tools/make/downloads/flatbuffers/include-Itensorflow/lite/micro/tools/make/downloads/ruy-Itensorflow/lite/micro/tools/make/gen/linux_x86_64_default/genfiles/-Itensorflow/lite/micro/tools/make/downloads/kissfft-ctensorflow/lite/micro/tools/make/gen/linux_x86_64_default/genfiles/tensorflow/lite/micro/examples/person_detection/testdata/no_person_image_data.cc-otensorflow/lite/micro/tools/make/gen/linux_x86_64_default/obj/core/tensorflow/lite/micro/tools/make/gen/linux_x86_64_default/genfiles/tensorflow/lite/micro/examples/person_detection/testdata/no_person_image_data.og++-std=c++11-fno-rtti-fno-exceptions-fno-threadsafe-statics-Werror-fno-unwind-tables-ffunction-sections-fdata-sections-fmessage-length=0-DTF_LITE_STATIC_MEMORY-DTF_LITE_DISABLE_X86_NEON-Wsign-compare-Wdouble-promotion-Wshadow-Wunused-variable-Wunused-function-Wswitch-Wvla-Wall-Wextra-Wmissing-field-initializers-Wstrict-aliasing-Wno-unused-parameter-DTF_LITE_USE_CTIME-Os-I.-Itensorflow/lite/micro/tools/make/downloads/gemmlowp-Itensorflow/lite/micro/tools/make/downloads/flatbuffers/include-Itensorflow/lite/micro/tools/make/downloads/ruy-Itensorflow/lite/micro/tools/make/gen/linux_x86_64_default/genfiles/-Itensorflow/lite/micro/tools/make/downloads/kissfft-ctensorflow/lite/micro/tools/make/gen/linux_x86_64_default/genfiles/tensorflow/lite/micro/models/person_detect_model_data.cc-otensorflow/lite/micro/tools/make/gen/linux_x86_64_default/obj/core/tensorflow/lite/micro/tools/make/gen/linux_x86_64_default/genfiles/tensorflow/lite/micro/models/person_detect_model_data.og++-std=c++11-fno-rtti-fno-exceptions-fno-threadsafe-statics-Werror-fno-unwind-tables-ffunction-sections-fdata-sections-fmessage-length=0-DTF_LITE_STATIC_MEMORY-DTF_LITE_DISABLE_X86_NEON-Wsign-compare-Wdouble-promotion-Wshadow-Wunused-variable-Wunused-function-Wswitch-Wvla-Wall-Wextra-Wmissing-field-initializers-Wstrict-aliasing-Wno-unused-parameter-DTF_LITE_USE_CTIME-I.-Itensorflow/lite/micro/tools/make/downloads/gemmlowp-Itensorflow/lite/micro/tools/make/downloads/flatbuffers/include-Itensorflow/lite/micro/tools/make/downloads/ruy-Itensorflow/lite/micro/tools/make/gen/linux_x86_64_default/genfiles/-Itensorflow/lite/micro/tools/make/downloads/kissfft-otensorflow/lite/micro/tools/make/gen/linux_x86_64_default/bin/person_detection_benchmarktensorflow/lite/micro/tools/make/gen/linux_x86_64_default/obj/core/tensorflow/lite/micro/benchmarks/person_detection_benchmark.otensorflow/lite/micro/tools/make/gen/linux_x86_64_default/obj/core/tensorflow/lite/micro/tools/make/gen/linux_x86_64_default/genfiles/tensorflow/lite/micro/examples/person_detection/testdata/person_image_data.otensorflow/lite/micro/tools/make/gen/linux_x86_64_default/obj/core/tensorflow/lite/micro/tools/make/gen/linux_x86_64_default/genfiles/tensorflow/lite/micro/examples/person_detection/testdata/no_person_image_data.otensorflow/lite/micro/tools/make/gen/linux_x86_64_default/obj/core/tensorflow/lite/micro/tools/make/gen/linux_x86_64_default/genfiles/tensorflow/lite/micro/models/person_detect_model_data.otensorflow/lite/micro/tools/make/gen/linux_x86_64_default/lib/libtensorflow-microlite.a-Wl,--fatal-warnings-Wl,--gc-sections-lmtensorflow/lite/micro/tools/make/gen/linux_x86_64_default/bin/person_detection_benchmarknon_test_binarylinuxInitializeBenchmarkRunnertook192ticks(0ms).WithPersonDataIterations(1)took32299ticks(32ms)DEPTHWISE_CONV_2Dtook895ticks(0ms).DEPTHWISE_CONV_2Dtook895ticks(0ms).CONV_2Dtook1801ticks(1ms).DEPTHWISE_CONV_2Dtook424ticks(0ms).CONV_2Dtook1465ticks(1ms).DEPTHWISE_CONV_2Dtook921ticks(0ms).CONV_2Dtook2725ticks(2ms).DEPTHWISE_CONV_2Dtook206ticks(0ms).CONV_2Dtook1367ticks(1ms).DEPTHWISE_CONV_2Dtook423ticks(0ms).CONV_2Dtook2540ticks(2ms).DEPTHWISE_CONV_2Dtook102ticks(0ms).CONV_2Dtook1265ticks(1ms).DEPTHWISE_CONV_2Dtook205ticks(0ms).CONV_2Dtook2449ticks(2ms).DEPTHWISE_CONV_2Dtook204ticks(0ms).CONV_2Dtook2449ticks(2ms).DEPTHWISE_CONV_2Dtook243ticks(0ms).CONV_2Dtook2483ticks(2ms).DEPTHWISE_CONV_2Dtook202ticks(0ms).CONV_2Dtook2481ticks(2ms).DEPTHWISE_CONV_2Dtook203ticks(0ms).CONV_2Dtook2489ticks(2ms).DEPTHWISE_CONV_2Dtook52ticks(0ms).CONV_2Dtook1222ticks(1ms).DEPTHWISE_CONV_2Dtook90ticks(0ms).CONV_2Dtook2485ticks(2ms).AVERAGE_POOL_2Dtook8ticks(0ms).CONV_2Dtook3ticks(0ms).RESHAPEtook0ticks(0ms).SOFTMAXtook2ticks(0ms).NoPersonDataIterations(1)took32148ticks(32ms)DEPTHWISE_CONV_2Dtook906ticks(0ms).DEPTHWISE_CONV_2Dtook924ticks(0ms).CONV_2Dtook1762ticks(1ms).DEPTHWISE_CONV_2Dtook446ticks(0ms).CONV_2Dtook1466ticks(1ms).DEPTHWISE_CONV_2Dtook897ticks(0ms).CONV_2Dtook2692ticks(2ms).DEPTHWISE_CONV_2Dtook209ticks(0ms).CONV_2Dtook1366ticks(1ms).DEPTHWISE_CONV_2Dtook427ticks(0ms).CONV_2Dtook2548ticks(2ms).DEPTHWISE_CONV_2Dtook102ticks(0ms).CONV_2Dtook1258ticks(1ms).DEPTHWISE_CONV_2Dtook208ticks(0ms).CONV_2Dtook2473ticks(2ms).DEPTHWISE_CONV_2Dtook210ticks(0ms).CONV_2Dtook2460ticks(2ms).DEPTHWISE_CONV_2Dtook203ticks(0ms).CONV_2Dtook2461ticks(2ms).DEPTHWISE_CONV_2Dtook230ticks(0ms).CONV_2Dtook2443ticks(2ms).DEPTHWISE_CONV_2Dtook203ticks(0ms).CONV_2Dtook2467ticks(2ms).DEPTHWISE_CONV_2Dtook51ticks(0ms).CONV_2Dtook1224ticks(1ms).DEPTHWISE_CONV_2Dtook89ticks(0ms).CONV_2Dtook2412ticks(2ms).AVERAGE_POOL_2Dtook7ticks(0ms).CONV_2Dtook2ticks(0ms).RESHAPEtook0ticks(0ms).SOFTMAXtook2ticks(0ms).WithPersonDataIterations(10)took326947ticks(326ms)NoPersonDataIterations(10)took352888ticks(352ms)

可以看到，人像检测模型运行10次的时间是三百多毫秒，一次平均三十几毫秒。这是在配备AMD标压R7 4800 CPU的Win10虚拟机下运行的结果。

模型文件路径为：./tensorflow/lite/micro/models/person_detect.tflite

同样，可以使用Netron查看模型结构，如下图所示：

person_detect

三、TFLM交叉编译

前面说明了如何在PC上编译TFLM，以及运行TFLM基准测试。由于是在PC平台上直接编译和运行的，因此生成的可执行文件和库文件都是x86平台的。

如果要生成LoongArch的库和可执行程序，则需要进行交叉编译。

3.1 配置loongarch64-linux-gnu-gcc环境

配置loongarch64-linux-gnu-gcc环境比较简单，基本上只需要如下几步即可：

将龙芯交叉编译工具链的压缩包解压；
再将龙芯交叉编译工具链所在目录添加到PATH环境变量中；

具体操作龙芯开发板手册中有详细描述，这里不再赘述。

配置成功后，可以使用如下命令进行测试：

loongarch64-linux-gnu-gcc -v

能够成功输出版本信息，则表示配置正确。

3.2 交叉编译Keyword基准测试

在之前的PC编译环境下，首先确保已经按照x86平台的命令成功编译过一次了，也就是如下命令成功执行了：

make -f tensorflow/lite/micro/tools/make/Makefile run_keyword_benchmark

接下来，我们可以使用如下命令，编译LoongArch的keyword_benchmark可执行程序：

MKFLAGS="-f tensorflow/lite/micro/tools/make/Makefile"# 清理之前编译生成的文件make$MKFLAGSclean# 通过 CC CXX 变量明确指定编译器MKFLAGS="$MKFLAGSCC=loongarch64-linux-gnu-gcc"MKFLAGS="$MKFLAGSCXX=loongarch64-linux-gnu-g++"# 编译make$MKFLAGSkeyword_benchmark -j8

编译完成后，将会在gen/linux_x86_64_default/bin目录下生成可执行程序。

通过如下命令，可以确认生成的是LoongArch的可执行文件：

可以看到，Machine字段是LoongArch。

3.3 交叉编译Person detection基准测试

类似的，我们可以使用如下命令，编译LoongArch的person_detection_benchmark可执行程序：

MKFLAGS="-f tensorflow/lite/micro/tools/make/Makefile"# 通过 CC CXX 变量明确指定编译器MKFLAGS="$MKFLAGSCC=loongarch64-linux-gnu-gcc"MKFLAGS="$MKFLAGSCXX=loongarch64-linux-gnu-g++"# 编译make$MKFLAGSperson_detection_benchmark -j8

编译完成后，将会在gen/linux_x86_64_default/bin目录下生成可执行程序。

类似的，通过如下命令，可以确认生成的是LoongArch的可执行文件。

四、在龙芯开发板上运行TFLM基准测试

4.1 将可执行程序放到开发板上

可以通过U盘拷贝，或者FTP传输等方式，将可执行程序放到龙芯2K0500开发板上。

4.2 龙芯上的Keyword基准测试

龙芯2K0500开发板上的Keyword基准测试，结果如下：

4.3 龙芯上的Person detection基准测试

龙芯2K0500开发板上的Person detection基准测试，结果如下：

可以看到，在龙芯2K0500开发板上的，对于有人脸的图片，连续运行10次人脸检测模型，总体耗时4991毫秒，每次平均耗时499.1毫秒；对于无人脸的图片，连续运行10次人脸检测模型，耗时4990毫秒，每次平均耗时499毫秒。

4.4 在树莓派3B+上运行TFLM基准测试

我手头还有一块吃灰很久的树莓派3B+，拿出来做个对比。

由于关键词识别的模型计算量太小了，这里直接跑一下人像检测的模型，最终结果为：

可以看到，在树莓派3B+上的，对于有人脸的图片，连续运行10次人脸检测模型，总体耗时4186毫秒，每次平均耗时418.6毫秒；对于无人脸的图片，连续运行10次人脸检测模型，耗时4190毫秒，每次平均耗时419毫秒。

4.5 和树莓派3B+上TFLM基准测试结果对比

龙芯2K0500开发板和树莓派3B+上TFLM基准测试结果对比，结果汇总如下：

	龙芯2K0500开发板	树莓派3B+
有人脸平均耗时（ms）	499.1	418.6
无人脸平均耗时（ms）	499	419
CPU最高主频（Hz）	600MHz	1.4G

从上表可以看到，在TFLM人像检测模型计算场景下，龙芯2K0500开发板和树莓派3B+单次计算耗时基本相当。

树莓派3B+的CPU频率比龙芯2K0500要高出一倍多，龙芯2K0500开发板上能跑出这样的成绩，已经相当不错了。龙芯2K0500开发板和树莓派3B+，这两块开发板之间除了CPU的差异外，内存颗粒和接口上也有差异。龙芯2K0500开发板使用的是DDR3内存，而树莓派3B+是LPDDR2内存，速度上要稍慢一些。