Skip to content

Add cpu-usec metric support under CLUSTER SLOT-STATS command#712

Merged
madolson merged 9 commits intovalkey-io:unstablefrom
kyle-yh-kim:20-cpu
Jul 23, 2024
Merged

Add cpu-usec metric support under CLUSTER SLOT-STATS command#712
madolson merged 9 commits intovalkey-io:unstablefrom
kyle-yh-kim:20-cpu

Conversation

@kyle-yh-kim
Copy link
Contributor

The metric tracks cpu time in micro-seconds, sharing the same value as INFO COMMANDSTATS, aggregated under per-slot context.

…io#20).

The metric tracks cpu time in micro-seconds, sharing the same
value as INFO COMMANDSTATS, aggregated under per-slot context.

Signed-off-by: Kyle Kim <[email protected]>
@kyle-yh-kim
Copy link
Contributor Author

First revision only holds implementation changes for initial feedback purposes, with pending perf testing.
Integration tests are not up-to-date, and thus failing. This will soon be followed-up.

@zvi-code
Copy link
Contributor

zvi-code commented Jul 3, 2024

@kyle-yh-kim

I totally get the value in per-slot information! But wondering about the how part.

Have you considered alternatives? For example:

  1. extend server.duration_stats[<type>] to server.duration_stats[<type>][max_slots +1] ?
  2. extend the real command to have slot dimension?
  3. keep the info on the client until the slot of the client is reset...

The reason I raise this is we have many slot array in many places in the code and this seems to be arbitrary addition to that and on the other hand we have several duration tracking mechanisms and this adds another one. I am wondering if we can merge/reuse these. In addition I can see we would want additional information like command count and maybe histogram.. this will not extend natively and we should strive to reuse what we have or enhance it, IMO.

@kyle-yh-kim
Copy link
Contributor Author

Thanks for your comments. To answer your concerns;

Have you considered alternatives? For example:

  1. extend server.duration_stats[] to server.duration_stats[][max_slots +1] ?
  2. extend the real command to have slot dimension?
  3. keep the info on the client until the slot of the client is reset...

The reason I raise this is we have many slot array in many places in the code and this seems to be arbitrary addition to that and on the other hand we have several duration tracking mechanisms and this adds another one.

For cpu-usec explicitly, I agree that this metric alone could be squeezed into other adjacent storages mentioned above, with added refactoring efforts.

However, when reviewing the per-slot metrics holistically, with the additions of per-slot 1) network-bytes-in [PR link], 2) network-bytes-out [PR-link] and 3) memory-bytes [PR-pending], this sparse disconnection between cpu-usec and the remaining per-slot metrics no longer makes sense.

Plus, we obtain cache locality benefits from keeping all metrics memory contiguous, as the CLUSTER SLOT-STATS command accesses all tracked metrics (cpu, network-bytes-in, network-bytes-out, memory-bytes) under the specified range of slots.

Given considerations of Valkey 8.0 rc1 code-cutoff timeline, refactoring efforts, remaining per-slot metrics storage cohesiveness, and cache locality benefits, my preference is to keep the slot_stats[] location as is. The only remaining change would be to follow the latest proposal from @zuiderkwast, which recommends to relocate the slot statistics array under clusterState such that it only consumes this memory in cluster mode;

typedef struct slotStat {
    uint64_t cpu_usec;
    uint64_t network_bytes_in;
    uint64_t network_bytes_out;
    uint64_t memory_bytes;
} slotStat;

struct clusterState {
    ...
    slotStat slot_stats[CLUSTER_SLOTS];
};

- Added more integration tests.
- Updated canAddCpuDuration() condition to avoid duplicate accounting
  upon exec, eval and fcall commands.

Signed-off-by: Kyle Kim <[email protected]>
@madolson
Copy link
Member

madolson commented Jul 14, 2024

The reason I raise this is we have many slot array in many places in the code and this seems to be arbitrary addition to that and on the other hand we have several duration tracking mechanisms and this adds another one. I am wondering if we can merge/reuse these. In addition I can see we would want additional information like command count and maybe histogram.. this will not extend natively and we should strive to reuse what we have or enhance it, IMO.

Command count might make, I'm not sure we really want a histogram on a per-slot basis seems to be missing the point of statistics though. The point of these metrics is for identify hot shards for migration for auto-balancing, it's not for trying to introspect too much into a specific slot.

@kyle-yh-kim kyle-yh-kim force-pushed the 20-cpu branch 2 times, most recently from 9659a40 to d3d71c1 Compare July 15, 2024 03:39
- Added a modifiable config, "cluster-slot-stats-enabled", with default value false.
- Added comments for bypassing EXEC, EVAL and FCALL calculations.
- Added integration test cases for functions.
- Fixed existing integration test cases.
- Moved slot_stats array under clusterState.

Signed-off-by: Kyle Kim <[email protected]>
- Updated the aggregation to bypass nested commands, and instead
  account for wrapper commands (EXEC, EVAL, FCALL).
- Added slot invalidation for cross-slot lua scripting.
- Added resetClusterStats().
- Added documentation to valkey.conf

Signed-off-by: Kyle Kim <[email protected]>
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, just some minor comments, please merge and we can address the comments and merge.

@madolson madolson added the release-notes This issue should get a line item in the release notes label Jul 22, 2024
@codecov
Copy link

codecov bot commented Jul 22, 2024

Codecov Report

Attention: Patch coverage is 94.28571% with 2 lines in your changes missing coverage. Please review.

Project coverage is 70.36%. Comparing base (14e09e9) to head (de6c75e).
Report is 1 commits behind head on unstable.

Additional details and impacted files
@@            Coverage Diff            @@
##           unstable     #712   +/-   ##
=========================================
  Coverage     70.36%   70.36%           
=========================================
  Files           112      112           
  Lines         61275    61308   +33     
=========================================
+ Hits          43115    43141   +26     
- Misses        18160    18167    +7     
Files Coverage Δ
src/blocked.c 91.48% <100.00%> (+0.03%) ⬆️
src/cluster.c 88.35% <100.00%> (+0.05%) ⬆️
src/cluster_legacy.c 85.94% <100.00%> (+0.21%) ⬆️
src/config.c 78.69% <100.00%> (+0.01%) ⬆️
src/script.c 87.50% <100.00%> (+0.09%) ⬆️
src/server.c 88.56% <100.00%> (+<0.01%) ⬆️
src/server.h 100.00% <ø> (ø)
src/cluster_slot_stats.c 90.32% <92.00%> (+1.10%) ⬆️

... and 7 files with indirect coverage changes

@kyle-yh-kim kyle-yh-kim force-pushed the 20-cpu branch 3 times, most recently from 1a1ce40 to 930fe95 Compare July 22, 2024 20:26
- Updated integration test to assert against expected slots existence.
- Updated valkey.conf

Signed-off-by: Kyle Kim <[email protected]>
Signed-off-by: Madelyn Olson <[email protected]>
- Updated ORDERBY handling to raise error appropriately.
- Updated integration tests to include the above test case.

Signed-off-by: Kyle Kim <[email protected]>
order_by = CPU_USEC;
} else {
addReplyError(c, "Unrecognized sort metric for ORDER BY. The supported metrics are: key-count.");
addReplyError(c, "Unrecognized sort metric for ORDERBY.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this error, it would be nice to be more useful. We can follow up on this though.

Signed-off-by: Madelyn Olson <[email protected]>
@madolson
Copy link
Member

@madolson madolson merged commit 5000c05 into valkey-io:unstable Jul 23, 2024
hwware pushed a commit to hwware/valkey that referenced this pull request Jul 25, 2024
…alkey-io#712)

The metric tracks cpu time in micro-seconds, sharing the same value as
`INFO COMMANDSTATS`, aggregated under per-slot context.

---------

Signed-off-by: Kyle Kim <[email protected]>
Signed-off-by: Madelyn Olson <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
zuiderkwast added a commit to valkey-io/valkey-doc that referenced this pull request Nov 18, 2024
Docs for CLUSTER SLOT-STATS, with key-count, cpu-usec, network-bytes-in,
and network-bytes-out metrics.

- valkey-io/valkey#351
- valkey-io/valkey#712
- valkey-io/valkey#720

---------

Signed-off-by: Kyle Kim <[email protected]>
Signed-off-by: Madelyn Olson <[email protected]>
Signed-off-by: Viktor Söderqvist <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
Co-authored-by: Viktor Söderqvist <[email protected]>
@enjoy-binbin enjoy-binbin changed the title Add cpu-usec metric support under CLUSTER SLOT-STATS command (#20). Add cpu-usec metric support under CLUSTER SLOT-STATS command Jun 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes This issue should get a line item in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants