Skip to content

Conversation

@eboasson
Copy link
Contributor

This is terribly large PR and not at all following the general rule that one should not mix unrelated things in a single PR. It is rather bad to not give a good example, but I have nonetheless decided to do it this way:

  • Each commit (potentially) builds on top of the preceding and so it doesn’t really make life simpler anyway considering that it all ends up in my lap anyway.
  • A good number of the small commits that are in here were the result of a discovery at some point, then inserted at the appropriate point in the history to have each commit a working (verified on Fedora 32 with the address sanitizer enabled).
  • A large large part of it comes from two other PRs that I would have merged into master had I not held back for other reasons. They've been sitting there for long enough now, and with the 0.6.0 release process under way, ROS2 Dashing not yet having moved to releases/0.5.x branch, and the preparations for merging security support into master it really makes more sense to merge these into the security branch.

I welcome anyone commenting on it. I'd recommend looking at individual commits ... In particular I'd be much obliged if @MarcelJordense could find some time to look at: d2ca446, 9579461, cacd39a and 50fa943; and if
@dennis-adlink would find time to look at: b0a19ef, 816d016 because I think they know that code well enough to do meaningful review without taking an inordinate amount of time.

What follows is a summary:

Part the First

In which #499 and #505 are moved over to the security branch. They are very nearly identical: a few types have been given a better name, a few unsigneds got changed to uint32_ts and a complete mess-up in one of the intermediate commits in #499 such that it wouldn’t even compile. For those who like to read diffs of diffs comparing them is easy enough ...

In short, from #499:

  • improve the tracing of sample content by distinguishing between unsigned and signed integers and floating-point numbers
  • bring sanity to the implementation of dds_read_instance (and similar)
  • add dds_readcdr (this affects the RHC plugin interface)
  • refactor & reduce code duplication in the implementation of reading/taking
  • strip out some wrapper functions that served no purpose in the RHC implementation

and #505:

  • saner behaviour of auto-dispose when there are multiple active writers: do the auto-dispose when the instance would otherwise have gone to the NOT_ALIVE_NO_WRITERS state
  • make dds_lookup_instance on a built-in topic work instead of crash because of a missing implementation of the from_sample operation …
  • various bits of refactoring

(This part ends at "Rework listener tests" 24ccb39)

Part the Second

In which the security branch went for looking cover. It was there that a few mouldy spots were discovered. A number of them could have developed into real problems so it is a good thing that this happened. That’s:

d2ca4467 Vet lengths of property lists in crypto deserialization
b0a19ef2 Fix double free if local identity validation fails on certificate expiry
816d0160 Memory leak reading access control configuration
95794611 Atomic update next heartbeat time for p2p writers
cacd39a7 Fix leak when inserting p2p message in WHC fails
50fa943b Fix conversion of user_data to security plugin
9e827098 Properly pair entity_pin/mutex_lock

Part the Third

In which the reworked listener tests were subjected to a bit of further rework: it makes vastly more sense to have an init/step/fini set of functions so that this little monster can be paired with custom code. No doubt there is more work to be done (e.g., turn the individual operations into nice functions so you can use them in more contexts, some of them might come in useful).

This then led to a few more things:

  • The instance state is now set to ALIVE when a sample is rejected (too old, doesn’t pass the filter), which means at least transient-local data will always go back to ALIVE after a reconnect.
  • It turns out that after a disconnect, rediscovery is blocked for 10s. For no good reason that I can think of. (Being able to block it for a while can be useful, but not for doing so by default.)
  • The dds_merge_listeners function changed from copying each listener from the source to the destination if and only if the destination is clear (for that listener) and the source is set (for that listener). Before, it’d copy it if the destination was clear, regardless of whether the source was set. The subtle difference in lies in the listener arguments, and I think this makes more sense. (Need to add set/get stuff that includes the argument …)
  • A macro DDS_STATUS_ID_MAX got added to the interface that makes life easier for anyone wanting to range over all status IDs.
  • The implementation of dds_wait_for_acks is now capable of waiting for acknowledgements from a specific reader, though this is not currently exposed in the interface. It’s worth considering, but there needs to be a compelling reason to do so, I’d say.

Part the Fourth

In which Clang’s static analyzer in its latest incarnation picked up a few dead stores (all eliminated) and GCC 10’s complained about a few things more. This does not yet eliminate all warnings from GCC 10.

@eboasson
Copy link
Contributor Author

Thanks @MarcelJordense

eboasson added 27 commits May 15, 2020 15:35
* Add a flag to indicate signed integral values and one to indicate
  floating-point values
* Set these flags in the output of idlc
* Use them when printing sample contents to the trace

By encoding the information as flags in reserved bits the actual
serialization and deserialization is unaffected.

Signed-off-by: Erik Boasson <[email protected]>
Scanning all instances was never good for anything: the RHC is organised
as hash table on instance id (which is an alias for "instance handle")
and it was always designed to do this with a fast lookup.

Signed-off-by: Erik Boasson <[email protected]>
Do not pass a dangling pointer to update_conditions_locked after
dropping an instance.  The dangling pointer did not actually get
dereferenced because of the state changes caused by dropping the
samples, but that is cutting a bit fine.

Signed-off-by: Erik Boasson <[email protected]>
Deadline registration, renewal and deregistration was somewhat spread
through the code and relied on the "isdisposed" flag as a proxy for
whether it was registered or not.  This consolidates the deadline
handling code in a final step of updating the instance and uses a
separate flag to track whether the instance is currently registered in
the deadline administration or not.

This also makes it possible to trivially change the rules for when
deadline notifications are required, and so allows for, e.g., adding a
mode in which instances in the "no writers" state do not trigger any
deadline missed notifications, or just once (both of which seem useful
modes).

Signed-off-by: Erik Boasson <[email protected]>
This changes the behaviour of auto-dispose writers: instead of always
disposing when the writer disposes the data, it now only disposes the
data when the instance would otherwise go to the "no writers" state.
This only affects the behaviour when there are multiple writers for the
same instance.

In case the writers use a different value for the auto-dispose setting,
it now tracks whether an instance has ever been touched by an writer
with auto-dispose enabled, and treats auto-disposes the instance when
the last writer leaves if this is the case.  This way, if an instance is
registered by one auto-dispose and one non-auto-dispose writer, the
order of unregistering does not matter.

Signed-off-by: Erik Boasson <[email protected]>
The entire point of this test program is to exercise the RHC while
checking its internal state.  The likelihood of (at least some)
forgetting to enable the "expensive" checks has been proven to be
significant.

Signed-off-by: Erik Boasson <[email protected]>
Use of an auto-dispose writer meant the NO_WRITERS case did not actually
get tested.  The behaviour of the implementation was to generate
deadline missed notifications for such instances, but the test expected
otherwise.

There is a disagreement between different DDS implementations on the
desirability of generating deadline missed notifications for NOT_ALIVE
instances.  Deadline notifications on DISPOSED instances seems silly, as
it means end-of-life.  Deadline notifications on a NO_WRITERS instance
are certainly valuable for applications that don't pay attention to the
number of writers (otherwise one has to monitor both liveliness changed
and deadline missed notifications to be be sure to get some
notification).

Different usage patterns definitely affect what is desirable and I doubt
one-size-fits-all is the right approach.  This commit changes the test
and retains the behaviour, and if it errs, it at least errs on the side
of caution.

Signed-off-by: Erik Boasson <[email protected]>
Disposing an instance would only add an invalid sample if the instance
is empty, but it should also do so when the latest sample is read.
Otherwise reading all NOT_READ samples gives nothing or nonsensical
output.

Signed-off-by: Erik Boasson <[email protected]>
The standard defines GUIDs as an array of 16 uint8_t and so they are
presented on the network and in built-in topic samples.  Internally they
are arrays of 4 uint32_t, requiring byte-order conversion.

A keyhash is also an array of 16 uint8_t, and used almost exclusively on
the network.  The exception is the generation of built-in topic samples,
which relies on the "from_keyhash" function.  One would expect the
keyhash here to contain the GUID in the external representation, and
this commit adds the byte-order conversions to conform to the
expectation.

Signed-off-by: Erik Boasson <[email protected]>
One cannot create writers for built-in topics, therefore one generally
does not create samples for them.  However, one can lookup an instance
handle from a sample with just a key value in it, and so the function is
needed.

Signed-off-by: Erik Boasson <[email protected]>
A few failures to signal DATA_AVAILABLE (as well as some where it was
signalled unnecessarily) were discovered while refactoring the RHC
despite the tests all passing.  Clearly the tests were inadequate.

The enormous amount of boilerplate in the tests prompted a small rewrite
to a programmable listener invocation tester that one simply feeds a
noise-like one-liner in a string.  This trades the boilerplate for
somewhat inscrutable code.

Signed-off-by: Erik Boasson <[email protected]>
Coverity has difficulty observering that dds_entity_pin /
ddsrt_mutex_lock / dds_entity_unlock is correct.  It is perhaps a bit
confusing, so change it.

Signed-off-by: Erik Boasson <[email protected]>
The security plugins currently use the standardized representations of
octet sequences, unlike the DDSI stack's internal representation.  A
shallow copy is therefore not simply a memcpy.

Signed-off-by: Erik Boasson <[email protected]>
Sending a heartbeat to all matched readers for the P2P builtin
participant volatile secure writer unlocks the writer before pushing
each individual message out, and so determining the time of the next
heartbeat event before writing and updating it afterwards means the
state may have changed.  While this is appears benign, it is better to
do the update atomically.

Signed-off-by: Erik Boasson <[email protected]>
The memory allocation in deserializing property lists within the crypto
code should not trust the deserialized length and try to allocate that
much memory but should first verify that the length is consistent with
the number of bytes remaining in the input.  (Noted by Coverity as use
of tainted data.)

Signed-off-by: Erik Boasson <[email protected]>
This leaves the argument pointer in the destination unchanged, rather
than resetting it to an irrelevant value.

Signed-off-by: Erik Boasson <[email protected]>
The code for executing one-liner tests might be more generally useful.

Signed-off-by: Erik Boasson <[email protected]>
In particular, this means instances published by a transient-local
writer will go back to ALIVE following a disconnect and reconnect.

Signed-off-by: Erik Boasson <[email protected]>
eboasson added 6 commits May 15, 2020 15:35
Signed-off-by: Erik Boasson <[email protected]>
The dds_wait_for_acks function follows the DCPS specification and allows
waiting for all matching readers to have acknowledged all data written
prior to that point.  This commit leaves the API unchanged but extends
the implementation to make it possible to wait until a specific reader
has acknowledged everything, as this is a useful device in testing with
deliberate one-way disconnections using dds_domain_set_deafmute.

Signed-off-by: Erik Boasson <[email protected]>
When defining a new topic, typically the serializer instructions that
are usually in constant memory and generated by the IDL compiler are
copied into memory managed by the Cyclone implementation.  For this it
needs to compute the size of the serializer, which the IDL compiler
doesn't provide.  It does this by effectively dry-running the
program.  (Note that it doesn't validate the program.)

All but the JSR operations move the program counter forward, but the JSR
operation can cause it to go backward instead and allows implementing
recursive types (the IDL compiler doesn't support them, but one might
decide to work around that limitation).  When dry-running the program,
following a backwards jump can cause a non-terminating loop.

The jump could potentially be to an unexplored address and so ignoring
all backwards jumps potentially means it skips part of the program.  As
this is not a validator and the program can always be arranged so that a
following a backwards jump is not relevant to computing the size
correctly, this is reasonable approximation.

Signed-off-by: Erik Boasson <[email protected]>
eboasson added 3 commits May 16, 2020 08:56
* Compute the time at which the handshake must have completed from the
  initial timeout specification, rather than using it as a timeout for
  the individual steps of the handshake
* If the handshake fails because an expected message is not present,
  print this, including whether the timeout occured because the message
  queue was empty or because the expected message could not be found in
  a non-empty queue.
* Replace the 0.5s per-step timeout to a single 2s timeout.

Signed-off-by: Erik Boasson <[email protected]>
* Increase the matching timeout to 5s (there are some hints the failures
  on Travis are timing related)
* Replace the relative timeout in the waitset by a timestamp so that it
  gives up after the specified timeout regardless of the number of
  events that occur

Signed-off-by: Erik Boasson <[email protected]>
@eboasson eboasson merged commit 2aa7054 into eclipse-cyclonedds:security May 16, 2020
@eboasson eboasson deleted the sec-coverity branch May 10, 2021 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants