Skip to content

Conversation

@Kami
Copy link
Member

@Kami Kami commented Aug 17, 2018

This pull request adds some additional metrics instrumentation to various services.

Rules Engine

  1. Track how many rules (trigger instances) are processed by rules engine and how long each processing took (counter + timer).
  2. Track how long it took to process each unique trigger instance by the rules engine (timer, counter makes no sense since each TriggerInstance is unique and will always only be processed once).

st2api, st2auth, st2stream

  • Number of requests / second + request processing timing info
  • Counters for various request and response related info:
    • request method
    • request path
    • response status code

Other Changes

  • EchoMetricsDriver which prints out all the metrics information to console which makes debugging easier

TODO

  • We need some documentation on exposed metrics and explanation what each one represents

Kami added 2 commits August 17, 2018 12:54
Now we also track:

1. Number of rules processed by rules engine (counter + timer)
2. How long it took to process rule for each unique trigger instance.
@Kami Kami requested a review from bigmstone August 17, 2018 11:16
trigger_instance, trigger_constants.TRIGGER_INSTANCE_PROCESSING)
self.rules_engine.handle_trigger_instance(trigger_instance)

with CounterWithTimer(key="st2.rule.processed"):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bigmstone I'm open to a better name, something which would also make it consistent with action runner metric names.

I couldn't come up with anything better :/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could also just call it st2.trigger_instance.processed, but this could also be a bit deceiving, imo, because trigger instances can also come in through the API and can be handled by other services.

At some point we might also care about total trigger instances (also the ones which are / will be processed elsewhere), but we definitely care specifically about trigger instances processed by the rules engine so the metrics key should convey that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably create a function to generate these namings in a standard way. Have a few input params so it's not so subjective, but like you I don't have many strong convictions here. I think it mainly just matters that it's consistent.

@Kami
Copy link
Member Author

Kami commented Aug 17, 2018

@bigmstone While working on that, I noticed we are missing "Gauge" metric type (so we can measure also total number of requests and similar and not just requests / second).

I will add that in this PR.

Kami added 5 commits August 17, 2018 13:33
NOTE: Prometheus driver conflats counter and gauge atm a bit, we should
eventually sort this out so using different drivers won't result in
using different metrics type and different representation and
visualization.
@Kami Kami added this to the 2.9.0 milestone Aug 17, 2018
Copy link
Contributor

@bigmstone bigmstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dzimine
Copy link

dzimine commented Aug 20, 2018

may be:

  • do we already have number of actions / sec - scheduled, executed, action queue size, time per action?
  • on rules - it's interesting to add color to rule processing:
    • triggers instances, rule evaluations (per trigger), rule invocations (per trigger)

@Kami
Copy link
Member Author

Kami commented Aug 21, 2018

@dzimine Thanks for the feedback.

Those are good metrics and most of them are tracked already now via this PR (docs at https://github.com/StackStorm/st2docs/pull/787/files?short_path=fa1dde0#diff-fa1dde031ec91f548c4d4c3a722ad980).

do we already have number of actions / sec - scheduled, executed, action queue size, time per action?

st2.action.executions and st2.action.executions.<execution status>. We don't have a special metric, but a queue size can be inferred using execution status metrics (running for running ones and delayed, requested, for ones which are waiting to be executed aka queue size).

on rules - it's interesting to add color to rule processing:triggers instances, rule evaluations (per trigger), rule invocations (per trigger)

We also have that now, but some of those are scoped just to a rule reference and not trigger instance. I will also make sure we have all of that data on per trigger instance basis (because yes, that's important, trigger instance + rule combo could give us a clue on what is going on - e.g. is it something with trigger instance payload in combination with some rule criteria which is slow, etc.).

On a related note - talked about this with @bigmstone on Slack yesterday. There is also a lot of "derived" metrics we could add to StackStorm, but, imo, that would add a lot of overhead and it's not necessary when those metrics can be derived using other existing metrics inside the monitoring visualization tool (e.g. execution status one for queue size, etc.).

Kami added 21 commits August 21, 2018 11:36
statsd counters are of a special type which is aggregated, sampled and
calculated into rate so decreasing those will result in invalid /
unexpected values.

Decreasing them would only make sense if statsd wouldn't do any
processing on them and treat them as raw values (e.g. gauges).
groups metrics based on the type so the suffix is redundant.
This option can specify an optional prefix which is prepended to each
metric key / name.

This comes handy when you want to use the same statsd or other backend
instance for multiple environments (each environment would specify a
different prefix).
@Kami Kami merged commit 289d9e3 into master Aug 22, 2018
@Kami Kami deleted the rules_engine_metrics_instrumentation branch August 22, 2018 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants