Skip to content

stm32f7: Large performance difference between stm32f746 and stm32f767 #14728

@bergzand

Description

@bergzand

Description

The stm32f746 and stm32f767 are almost identical, only different peripherals and a different cache size, and one would expect at least somewhat identical benchmark results between the two cores. However, using tests/bitarithm_timings shows widely different results.

nucleo-f746zg
2020-08-07 15:20:21,461 # START
2020-08-07 15:20:21,467 # main(): This is RIOT! (Version: 2020.10-devel-596-g01e6b-HEAD)
2020-08-07 15:20:21,468 # Start.
2020-08-07 15:20:26,473 # + bitarithm_msb: 4529488 iterations per second
2020-08-07 15:20:31,476 # + bitarithm_lsb: 3793632 iterations per second
2020-08-07 15:20:36,481 # + bitarithm_bits_set: 2145251 iterations per second
2020-08-07 15:20:41,486 # + bitarithm_test_and_clear: 776978 iterations per second
2020-08-07 15:20:41,487 # Done.
nucleo-f767zi
2020-08-07 15:18:58,026 # Help: Press s to start test, r to print it is ready
2020-08-07 15:18:58,027 # START
2020-08-07 15:18:58,032 # main(): This is RIOT! (Version: 2020.10-devel-596-g01e6b-HEAD)
2020-08-07 15:18:58,033 # Start.
2020-08-07 15:19:03,037 # + bitarithm_msb: 4023283 iterations per second
2020-08-07 15:19:08,041 # + bitarithm_lsb: 4601862 iterations per second
2020-08-07 15:19:13,047 # + bitarithm_bits_set: 395107 iterations per second
2020-08-07 15:19:18,051 # + bitarithm_test_and_clear: 1830507 iterations per second
2020-08-07 15:19:18,052 # Done.

Notable is that on one bitarithm_msb is faster and on the other bitarithm_lsb is faster. This is odd considering that they have the same instruction set and inspecting the binaries also shows identical code for these functions.

Now after confirming this, I flashed the firmware built for the nucleo-f767zi on the nucleo-f746zg. The two boards and cores are identical enough to make this work. The other way around is also possible without any issues for this test application. In short, flashing firmware for board A on board B shows the same performance as flashing firmware for board A on board A. This also holds the other way around.

The compiled binaries for these two boards are almost identical. There is only a difference in the number of interrupts, 98 IRQ lines vs 110 IRQ lines. This shifts all function addresses a bit, so to easily compare the content of two firmware ELF files with eachother, I removed the 12 extra IRQ lines on the stm32f767. With this the two ELF files are almost identical, Two words were different. With only this difference remaining, I flashed the new binary on the boards, and voila, matching measurements between the two boards. By changing the number of allocated IRQ handlers, all functions are shifted by a certain amount, causing different measurements. Increasing the number of handlers beyond what is useful also changes (and not necessarily increases) the performance.

TL;DR (spoilers)

TL;DR: Modifying the number of allocated IRQ handlers changes the measured performance of tests/bitarithm_timings

Steps to reproduce the issue

With tests/bitarithm_timings:

  • Test firmware for the nucleo-f746zg on the nucleo-f746zg and the nucleo-f767zi.
  • Test firmware for the nucleo-f767zi on the nucleo-f746zg and the nucleo-f767zi.

This should reproduce my numbers above.

Expected results

Identical measurement between the two MCU's

Actual results

Different results between the two MCU's

Metadata

Metadata

Assignees

Labels

Area: cpuArea: CPU/MCU portsPlatform: ARMPlatform: This PR/issue effects ARM-based platformsType: bugThe issue reports a bug / The PR fixes a bug (including spelling errors)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions