Skip to content

Conversation

@kaspar030
Copy link
Contributor

@kaspar030 kaspar030 commented Apr 9, 2018

Contribution description

This is a shot at providing a secure authenticated over-the-air firmware update mechanism based on tweetnacl and CoAP.

Please see the included documentation in "sys/include/firmware.h" for more details including a description of the architecture.

This PR includes work from @kYc0o.

Currently only configured for and tested on samr21-xpro.

Issues/PRs references

Waiting for #8772 and #8788 for the necessary CoAP infrastructure.
Supersedes #7389, #7396, #7457.

@kaspar030 kaspar030 added Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation Platform: ARM Platform: This PR/issue effects ARM-based platforms Area: build system Area: Build system State: waiting for other PR State: The PR requires another PR to be merged first Area: security Area: Security-related libraries and subsystems Area: OTA Area: Over-the-air updates CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR labels Apr 9, 2018
@kaspar030 kaspar030 changed the title Ota firmware updates sys: provide secure over-the-air update mechanism Apr 9, 2018
@kYc0o kYc0o self-assigned this Apr 9, 2018
@waehlisch
Copy link
Member

I'm not sure if "secure" is the right term based on the current implementation. The proposal just verifies the integrity of the received image. Verification of integrity is only one aspect in the context of security.

Why should the RIOT node trust the public key?

The current proposal does not allow for downgrade of a firmware image. Is this useful?

@jnohlgard
Copy link
Member

Did you look at other implementations and proposals, such as the standardization work being done by the IETF SUIT WG (https://datatracker.ietf.org/group/suit/about/)?

@kaspar030
Copy link
Contributor Author

I'm not sure if "secure" is the right term based on the current implementation. The proposal just verifies the integrity of the received image. Verification of integrity is only one aspect in the context of security.

Apart from the integrity check, through checking the signature, updates are authenticated, as only the owner of the public key can issue a valid update request.

There are many threat vectors that are not covered by this simple mechanism, but IMO it provides enough security to actually be usable.

Why should the RIOT node trust the public key?

Why should any node trust any key?

The current proposal does not allow for downgrade of a firmware image. Is this useful?

That is a feature that can be added later, if needed.

@kaspar030
Copy link
Contributor Author

Did you look at other implementations and proposals, such as the standardization work being done by the IETF SUIT WG (https://datatracker.ietf.org/group/suit/about/)?

Yes, I've been watching both the MCUboot development and SUIT's draft.
I think our smallest devices might need something less complex.

@kaspar030
Copy link
Contributor Author

That is a feature that can be added later, if needed.

Also, downgrading is always possible by sending a newly signed update with an old version but higher image version.

@waehlisch
Copy link
Member

waehlisch commented Apr 9, 2018

I'm not sure if "secure" is the right term based on the current implementation. The proposal just verifies the integrity of the received image. Verification of integrity is only one aspect in the context of security.

Apart from the integrity check, through checking the signature, updates are authenticated, as only the owner of the public key can issue a valid update request.

Do you want to say that only the owner of the private key that belongs to the public key can issue a valid signature? Still, as long as I cannot verify that the public key belongs to the right entity, the signature is useless.

There are many threat vectors that are not covered by this simple mechanism, but IMO it provides enough security to actually be usable.

You mean lines 81-89 in sys/include/firmware.h discuss the covered threat vectors? IMO I would not consider this as usable security.

Why should the RIOT node trust the public key?

Why should any node trust any key?

Why do you use signatures?

The current proposal does not allow for downgrade of a firmware image. Is this useful?

That is a feature that can be added later, if needed.

No, the questions was whether this limitation leads to problems when deployed.

@waehlisch
Copy link
Member

That is a feature that can be added later, if needed.

Also, downgrading is always possible by sending a newly signed update with an old version but higher image version.

The term "version" seems semantically overloaded: version of firmware and some kind of signing version (which is more or less a timestampt).

Anyhow, your approach does not allow for the following scenario: A user gets a valid firmware image from a third party. The user updates the IoT device with a new version and then decides that he liked the previous version more. Downgrading to the previous version is not possible without interacting with the third party.

@waehlisch
Copy link
Member

Btw, looking into the description of the commits, I'm wondering why most of them are submitted under the PR caption "sys: provide secure over-the-air update mechanism".

Many of the commits are independent of OTA, e.g., "unittests/tests-nanocoap: add coap_get_uri()" test or "nanocoap: add server-side block1 support",

Just wondering.

@kaspar030 kaspar030 changed the title sys: provide secure over-the-air update mechanism sys: provide over-the-air update mechanism Apr 10, 2018
@kaspar030
Copy link
Contributor Author

Anyhow, your approach does not allow for the following scenario

The approach considers the owner of the private key to be the user, and that user can thus downgrade. Downgrading by just making the bootloader boot the previous image is currently not suported, but as said, easily added (one flash page erase call).

@waehlisch I think there's a misunderstanding here. This PR does not try to present a 100% secure firmware upgrade mechanism that fits every use case any hypothetical concept can.
The idea is to get building blocks in place that allow implementation of remotely upgradable systems. Ideally, the concept allows deployment of an improved mechanism to nodes in the field.

Many aspects of remote upgrading, e.g., policies on when to initiate a necessary reboot or multiple stakeholders, have been left out, in order to reduce complexity, and also to get something usable in code, and gain experience in actually implementing over-the-air upgrades. To get where this PR is now already taught a couple of lessons that any high-level discussions just ignore.
Also, I actually do have customers that cannot wait for SUIT implementations.

So instead of bashing what the mechanism presented here cannot do, I'd rather concentrate on what it can do.

I've removed the security claims from the PR title and description and will adapt the documentation for now. Better to provide certain security features and not claim to do so than claim them and not deliver on expectations.

@kaspar030
Copy link
Contributor Author

Many of the commits are independent of OTA

As stated, this PR depends on two CoAP PR's for blockwise transfer support. The commit list shows those commits, too.

@emmanuelsearch
Copy link
Member

emmanuelsearch commented Apr 10, 2018

@waehlisch @kaspar030 let's not get over-excited about the term "security" here.
Although minimal in many aspects, the PR provides a solid base to for an OTA prototype to be tested, assessed, and further developed. In particular:

  • the simplistic firmware metadata used here can be upgraded to SUIT standard compliance down the road;
  • in this basic approach the public key of the authorized software maintainer is pre-provisionned on the device at commissioning time. This provides basic, well-known security guarantees and limitations;
  • in this basic approach, a single software maintainer is authorized to update the software on the IoT device (no 3rd parties);
  • though somewhat awkward, a (controlled) rollback is possible with this basic approach: the authorized software maintainer compiles the older version of the firmware, and assigns a new version number then pushes it to the device.

Long story short: this base clearly has value. Let's work together on improving and extending it (either in this PR or in subsequent PRs).

@waehlisch
Copy link
Member

waehlisch commented Apr 10, 2018

@kaspar030 it is not about bashing, the discussion is about getting things clear.

@emmanuelsearch

in this basic approach, a single software maintainer is authorized to update the software on the IoT device (no 3rd parties);

As far as I understand, anyone can update the software on the IoT device because there is no authentification of the client who sends the update. The only thing that is needed is a valid firmware image (i.e., higher version number and signed by the private key that relates to the public key stored on the node/previous version) and connectivity.

@emmanuelsearch
Copy link
Member

@waehlisch

anyone can update the software on the IoT device because there is no authentification of the client who sends the update

You mean, this would be solved by using DTLS?

@bergzand
Copy link
Member

anyone can update the software on the IoT device because there is no authentification of the client who sends the update

You mean, this would be solved by using DTLS?

The way COSE+SUIT solve this is by requiring a correctly signed manifest to ensure that the manifest is generated by an authorized party and to require a manifest version number to be strictly monotonic to prevent replay attacks with an old manifest by an attacker.

From what I read from the comments in the code, an identical construction is also in place here.

@@ -1,5 +1,6 @@
/*
* Copyright (C) 2017 Inria
* Copyright (C) 2018 Kaspar Schleiser <[email protected]>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the original author is @kYc0o, I would leave the 2017 Inria copyright on top and put your personal under. I saw that in other places.

Copy link
Contributor

@aabadie aabadie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first pass a code review. I skipped the nanocoap parts. Finally, I have quite a few comments.

{
rom (rx) : ORIGIN = _rom_start_addr + _boot_offset, LENGTH = _rom_length - _boot_offset
ram (rwx) : ORIGIN = _ram_start_addr, LENGTH = _ram_length
rom (rx) : ORIGIN = _rom_start_addr + _rom_offset, LENGTH = _rom_length
Copy link
Contributor

@aabadie aabadie Apr 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removing the _boot_offset (_rom_offset now) in the ROM length ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break current support for pre-flashed bootloaders, you might end flashing a larger image than the space available. @kaspar030 what's your use case here?

@@ -0,0 +1,61 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about simply calling the parent directory bootloader ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you suggesting to change the name "riotboot"? there might be other bootloaders coming.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think the bootloader should be at the root of RIOT. Even if other bootlaoders come, we'll have a very hard time to integrate them into the base code, at best, they'll come as packages. Any strong reason/motivation to keep it in dist? It's not even possible to compile it/use it outside RIOT...

firmware_jump_to_slot(slot);
}

/* serious trouble! */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope noone ever finds out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all the conditions to boot an image are not met, the bootloader boots nothing, which indeed, is serious trouble...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification, that I think should replace the 'serious trouble' comment for clarity. But then, in this case, why is the endless while loop required ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is intended to provide a defined error behaviour.

extern "C" {
#endif

off_t fsize(const char *filename);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undocumented functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal functions

Copy link
Contributor

@kYc0o kYc0o Apr 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaspar030 you might however prefix with _ isn't it? Though would look strange in a header... I'd say if it's internal we might change its place. Isn't doxygen complaining btw?

return 1;
}

crypto_sign_keypair(pk,sk);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space after ,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

*/
void firmware_dump_slot_addrs(void);

extern const unsigned firmware_num_slots;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not documented (I'll check later what is it for)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* @{
*
* @file
* @brief Firmware update cia CoAP implementation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/cia/via

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

coap_block1_t block1;
int blockwise = coap_get_block1(pkt, &block1);

LOG_INFO("ota: received bytes %u-%u", \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\ is not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}
else {
if (block1.offset == _state.offset) {
res = firmware_update_putbytes(&_state, block1.offset, pkt->payload, pkt->payload_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is a bit long and could be split

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

CPU_FLASH_BASE + SLOT0_SIZE + SLOT1_SIZE
};

const unsigned firmware_num_slots = sizeof(_firmware_slot_start)/sizeof(unsigned);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it's here. Do you really need to retrieve (using extern) it in sys/firmware.h ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, riotboot needs it in its main().

@kaspar030 kaspar030 force-pushed the ota_firmware_updates branch from c99c070 to 99468d8 Compare April 10, 2018 21:32

- flash image and bootloader

$ "BOARD=samr21-xpro APP_VER=$(data +%s) make -j4 riotboot/flash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

date ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, fixed

@kaspar030 kaspar030 force-pushed the ota_firmware_updates branch from 99468d8 to 5f79ae7 Compare April 11, 2018 10:44
* @param[in] pk NaCL public signing key to use
*
*/
int firmware_validate_metadata_signature(firmware_metadata_t *metadata, const unsigned char *pk);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to know what the return value means but unfortunately it's not documented.

@kaspar030 kaspar030 force-pushed the ota_firmware_updates branch from 558ccb6 to 714ba81 Compare June 1, 2018 13:39
@OYTIS
Copy link
Contributor

OYTIS commented Jul 19, 2018

Hi all!
Do you need some new hands to make this PR ready? I would be interested in having at least bootloader/flasher part in.

@kYc0o @bergzand @emmanuelsearch @aabadie

@danpetry
Copy link
Contributor

Hi @OYTIS and welcome! We are breaking down this PR and merging it in a set of smaller PRs. Please see #9342 for the tracking issue. And of course, any help you are willing to give would be great!

@OYTIS
Copy link
Contributor

OYTIS commented Jul 22, 2018

@danpetry Oh cool, glad it's not abandoned. So one can just take an element from the diagram and try to make a PR out of it?

@danpetry
Copy link
Contributor

@OYTIS some PRs are already being prepared, and I'm not sure what the state is of all work packages. I recommend introducing yourself on the issue and coordinating with the others - and maybe proposing something that you can help with?

@basilfx
Copy link
Member

basilfx commented Aug 3, 2018

For the reference, I have been playing with this last night to get this working on an EFM32 (SLSTK3402a and SLSTK3401a). It did not work out of the box, but here are the changes I made. I disabled signatures since they don't work, and I was interested in the general principle, because I need it.

Some general things I noticed:

  • I had to disable most parts of cpu_init() while booting, because doing it twice (bootloader, then application) stalled the CPU.
  • I've noticed that JLink is sometimes too slow to flash two images after each other. If the CPU is stuck, it cannot unlock the CPU (bootloader fails, but the application works, or the other way around). JLink does support loading multiple images in the same execution.
  • Flashing the bootloader when running riotboot/flash-slot1 is IMHO not needed. It speeds up the process, and the bootloader doesn't change (that often).

@kYc0o
Copy link
Contributor

kYc0o commented Aug 4, 2018

Hi @basilfx ! Thanks for testing it!

In general, the discussion I think needs to take place now in #9342, since I think this PR got a bit outdated. Otherwise, I have some comments:

I had to disable most parts of cpu_init() while booting, because doing it twice (bootloader, then application) stalled the CPU.

This is actually not very good, and means that the clocking initialisation should be verified. I had similar issues at the very beginning of the implementation of the bootloader, which showed me that clocking must be carefully configured to allow re-initialisations. It's not very wanted to "#ifdef" parts in cpu_init(). Thus, if you have time, it would be nice if you can find exactly what makes trouble at the initialisation point.

I've noticed that JLink is sometimes too slow to flash two images after each other. If the CPU is stuck, it cannot unlock the CPU (bootloader fails, but the application works, or the other way around). JLink does support loading multiple images in the same execution.

@cladmi is working on that and #9514 and #8838 should help to get several files which are then passed to the flasher. We should then adapt JLink and openocd to flash them at the right place on the ROM. #9351 is also necessary to ease the task.

Flashing the bootloader when running riotboot/flash-slot1 is IMHO not needed. It speeds up the process, and the bootloader doesn't change (that often).

I agree with you on this, and I think it will be done in this way to only replace the needed slot.

@cladmi
Copy link
Contributor

cladmi commented Aug 6, 2018

I had to disable most parts of cpu_init() while booting, because doing it twice (bootloader, then application) stalled the CPU.

This is actually not very good, and means that the clocking initialisation should be verified. I had similar issues at the very beginning of the implementation of the bootloader, which showed me that clocking must be carefully configured to allow re-initialisations. It's not very wanted to "#ifdef" parts in cpu_init(). Thus, if you have time, it would be nice if you can find exactly what makes trouble at the initialisation point.

Thanks for your feedback. When integrating it, to make it testable, I would maybe represent it as a cpu_init_idempotent feature. With a dedicated test where cpu_init is called multiple times and should keep working properly.

I've noticed that JLink is sometimes too slow to flash two images after each other. If the CPU is stuck, it cannot unlock the CPU (bootloader fails, but the application works, or the other way around). JLink does support loading multiple images in the same execution.

So for JLINK you would need to do a "flash a big image with bootloader, metatada, firmware" instead of flashing 2 times ?

@basilfx
Copy link
Member

basilfx commented Aug 6, 2018

When integrating it, to make it testable, I would maybe represent it as a cpu_init_idempotent feature. With a dedicated test where cpu_init is called multiple times and should keep working properly.

So you mean that there is cpu_init and a cpu_init_idempotent? I think that the actual cpu_init should be deferred until the real application boots. For instance, EFM32 initializes the DC-DC converter, power modes, and so on. That doesn't make sense in the bootloader. Also, leaving these parts out reduces the bootloader size.

So for JLINK you would need to do a "flash a big image with bootloader, metatada, firmware" instead of flashing 2 times?

No, JLink doesn't need one big image file. It can, however, flash multiple images in one 'session'. Currently, multiple 'sessions' are needed, and this sometimes fails, because JLink isn't ready for the next 'session'.

@cladmi
Copy link
Contributor

cladmi commented Aug 7, 2018

So you mean that there is cpu_init and a cpu_init_idempotent? I think that the actual cpu_init should be deferred until the real application boots. For instance, EFM32 initializes the DC-DC converter, power modes, and so on. That doesn't make sense in the bootloader. Also, leaving these parts out reduces the bootloader size.

I was more meaning that a CPU defines FEATURE_PROVIDED_my_cpu_init_can_be_called_again and make it is as a required feature for OTA support. Some bootloader may require you to have activated some cpu features and I call this cpu_init.

If it could be split between cpu_init and cpu_init_advanced_functionalities is, for me, optimization that I do not care about in the build system ;)

No, JLink doesn't need one big image file. It can, however, flash multiple images in one 'session'. Currently, multiple 'sessions' are needed, and this sometimes fails, because JLink isn't ready for the next 'session'.

That would maybe mean a specific target for JLink as the build system and flash scripts are made for flashing one single file as there is only one FLASHFILE variable (not even merged right now). If it is binary files, it also requires an address so an address per file.

If there is a way to know when JLink is ready again, we could add an option to the flasher that waits until it puts the board in a stable state.

I would also love this for samr21-xpro which often fail when you do flash test because of a Device or resource busy error.

@danpetry danpetry mentioned this pull request Aug 7, 2018
@cladmi
Copy link
Contributor

cladmi commented Aug 8, 2018

After having slept on it, and also re-check your code, I would say what you did is ok for cpu_init.

Correct me if I am wrong @kYc0o, what we did not want is bootloader inits the CPU, then in the RIOT firmware, part of cpu_init must be disabled as it cannot be called two times.

However, having a use_minimal_cpu_init configuration in the bootloader, and the firmware runs the normal cpu_init would be ok. (to oppose with what I said with cpu_init normal and advanced, here it would be more minimal and normal, and I like it more.

I am just not sure for the reboot case if there could not also be a problem ? but @kYc0o would know better.

@cladmi
Copy link
Contributor

cladmi commented Aug 8, 2018

TL;DR; for me it's ok to constraint the bootloader, if there is no issues with rebooting to do an update.

@kYc0o
Copy link
Contributor

kYc0o commented Aug 8, 2018

After having slept on it, and also re-check your code, I would say what you did is ok for cpu_init.

Correct me if I am wrong @kYc0o, what we did not want is bootloader inits the CPU, then in the RIOT firmware, part of cpu_init must be disabled as it cannot be called two times.

Well, this is the issue. Why is not possible to re-init the clock as many times as we want? And, why is it possible to do for Atmel, ST and Nordic Cortex-Ms? The goal is to have a system which is not affected by its "previous" state when executing jumps to the first member of the vector table (reset_handler in most cases). My short answer here is that the clock is not being well initialised and the code needs rework. That was the case on Atmel and Cortex CPUs. I faced this problem also on the Kinetis, on which I tried exactly the same as @basilfx but the code is not merged yet, but I think is no more necessary since several bugs were recently fixed on the clock initialisation for such platforms.

However, having a use_minimal_cpu_init configuration in the bootloader, and the firmware runs the normal cpu_init would be ok. (to oppose with what I said with cpu_init normal and advanced, here it would be more minimal and normal, and I like it more.

I think this would bring much more complexity and I'm convinced there should be other ways to do it. The case of EFM might be a bit special since it uses a lot of code from the vendor, which might not be optimal. For the rest of the CPUs we provide most of the code, especially the clock_init, which is the problem here. So maybe we need to do an exception here but I'd prefer to not make a painful distinction between bootloader and applications, both IMHO should safely initialise the CPU.

I am just not sure for the reboot case if there could not also be a problem ? but @kYc0o would know better.

Normal reboot is not affected, it makes a full reset of registers and clocks.

@basilfx
Copy link
Member

basilfx commented Aug 8, 2018

I think this would bring much more complexity and I'm convinced there should be other ways to do it. The case of EFM might be a bit special since it uses a lot of code from the vendor, which might not be optimal.

I definitely think it is fixable. I haven't spend more than a few minutes to discover the workaround to disable it worked :-)

That said, I do think it a minimal case would be desired: why spend a lot of time (and energy) in the bootloader to setup the CPU clocks, when you want to boot a slot as fast as possible? When the EFM32 boots, it is in a mode that works. I would consider it an optimization.

@miri64 miri64 added State: waiting for other PR and removed CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR labels Oct 5, 2018
@emmanuelsearch
Copy link
Member

@kaspar030 closing this in favor of #11818
(feel free to reopen if deemed necessary!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: build system Area: Build system Area: OTA Area: Over-the-air updates Area: security Area: Security-related libraries and subsystems Platform: ARM Platform: This PR/issue effects ARM-based platforms State: waiting for other PR State: The PR requires another PR to be merged first Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.