STM32H7 + FreeRTOS — deploying ML models directly to Cortex-M7 without any RTOS bloat, lessons learn: Web and IoT

Navigation

RSS Feeds

Articles Downloads Forums News Web Links

Member Polls

STM32H7 + FreeRTOS — deploying ML models directly to Cortex-M7 without any RTOS bloat, lessons learn

Last updated on 2 months ago

automation

Track thread Print

KevinVeteran Member

Posted 2 months ago

fw_engineer_v OP
4 days ago
I've been deploying TFLite Micro models on STM32H7 — specifically the H743ZI Nucleo board — running FreeRTOS for a wearable health monitor project. The H7 has 1MB SRAM and runs at 480MHz on the Cortex-M7 with a double-precision FPU and L1 cache, making it genuinely the best MCU in the Cortex-M lineup for inference work that doesn't need a full Linux system. The tricky part is that FreeRTOS task stacks and the TFLite Micro tensor arena fight over the same SRAM, and if you don't plan the memory map upfront you'll get hard faults that are near-impossible to debug. My key lesson after two weeks of pain: put the tensor arena in DTCM RAM — the 128KB tightly-coupled memory accessible only to the M7 — it's faster than SRAM1 and bypasses the AHB bus matrix entirely, so FreeRTOS tasks on SRAM1 never contend with your inference workload.

KevinVeteran Member

Posted 2 months ago

bare_metal_bn
3 days ago
The DTCM placement trick is a genuine secret that isn't in any official ST tutorial or application note. Here's how you place the tensor arena in DTCM via the GCC linker script for the H743:

/* STM32H743 GCC linker script excerpt */
MEMORY {
FLASH (rx) : ORIGIN = 0x08000000, LENGTH = 2048K
DTCMRAM (xrw) : ORIGIN = 0x20000000, LENGTH = 128K
RAM (xrw) : ORIGIN = 0x24000000, LENGTH = 512K
RAM_D2 (xrw) : ORIGIN = 0x30000000, LENGTH = 288K
}

/* TFLite Micro tensor arena → DTCM (zero wait state) */
.tflite_arena (NOLOAD) : {
. = ALIGN(16);
_tflite_arena_start = .;
KEEP(*(.tflite_arena))
_tflite_arena_end = .;
} > DTCMRAM
Then in your C source: __attribute__((section(".tflite_arena"), aligned(16))) static uint8_t tensor_arena[ARENA_SIZE]; — and you get zero-wait-state memory access for every inference operation. I measured a 22% inference speedup versus placing the arena in SRAM1 on a 1D-CNN model. The alignment to 16 bytes is required for NEON ops; misaligned tensor arenas cause silent correctness bugs that are extremely hard to track down.

KevinVeteran Member

Posted 2 months ago

otto_embedded
2 days ago
ST has their own tool — STM32Cube.AI, now part of the X-CUBE-AI pack — that takes Keras or ONNX models and generates optimized C code using CMSIS-NN intrinsics directly. It's a fundamentally different approach to TFLite Micro and in my testing on the H7 it's consistently 15–30% faster for the layer types that CMSIS-NN covers: Conv2D, depthwise separable, fully connected, and pooling. The generated code is clean portable C so you can actually audit it, which matters for safety-critical applications like wearable health monitors. The toolchain is Windows-centric which is annoying, and the validation step requires STM32CubeIDE as a heavy install, but if you're targeting ST silicon specifically the runtime performance is genuinely hard to argue with.

KevinVeteran Member

Posted 2 months ago

maya_hw_hacks
1 day ago
FreeRTOS task design tip for inference workloads: run your ML inference in a dedicated FreeRTOS task at osPriorityNormal and give it a large stack — minimum 8KB, ideally 16KB, because TFLite Micro ops allocate stack frames quite deeply. Use a FreeRTOS queue to decouple sensor data collection from inference entirely: the sensor ISR or a high-priority data collection task pushes raw samples into the queue, and the inference task blocks on that queue and processes whenever data arrives. This way you never lose a sensor reading while inference is running. Critical: avoid wrapping the full inference call in a taskENTER_CRITICAL() section — you don't want to block interrupts for the 10–50ms that inference takes on H7. Structure it so only the result handoff to downstream tasks needs a brief critical section. That one design decision is the difference between a stable deployment and an inexplicable watchdog reset every few hours.

You can view all discussion threads in this forum.
You cannot start a new discussion thread in this forum.
You cannot reply in this discussion thread.
You cannot start on a poll in this forum.
You cannot upload attachments in this forum.
You cannot download attachments in this forum.

Users Online Now

Guests Online 5
Members Online 0

Total Members: 54
Newest Member: WilliamEsony

Get your free widget for your website / blog that shows recent visitor feed Here

STM32H7 + FreeRTOS — deploying ML models directly to Cortex-M7 without any RTOS bloat, lessons learn