Skip to content

Conversation

@xbasel
Copy link
Collaborator

@xbasel xbasel commented Jun 30, 2025

This change adds support for active expiration of hash fields with TTLs (Hash Field Expiration), building on the existing key-level expiry system.

Field TTL metadata is tracked in volatile sets associated with each hash key. Expired fields are reclaimed incrementally by the active expiration loop, using a new job type to alternate between key expiry and field expiry within the same logic and effort budget.

Both key and field expiration now share the same scheduler infrastructure.

Alternating job types ensures fairness and avoids starvation, while keeping CPU usage predictable.

+-----------------+
|     DB          |
+-----------------+
        |
        v
+---------------------+
|     myhash          |  (key with TTL)
+---------------------+
        |
        v
+------------------------------------+
| fields (hashType)                  |
|  - field1                          |
|  - field2                          |
|  - fieldN                          |
+------------------------------------+
        |
        v
+------------------------------------+
| volatile set (field-level TTL)     |
|  - field1 expires at T1            |
|  - field5 expires at T5            |
+------------------------------------+

No new configuration was introduced; the existing active-expire-effort and time budget are reused for both key and field expiry.

Also active defrag for volatile sets is added.

@xbasel
Copy link
Collaborator Author

xbasel commented Jun 30, 2025

THIS IS NOT LONGER RELEVANT

Please note that at first glance, it seems logical to try and unify the active expiry cycle for keys and for hash fields with TTLs. After all, both processes are “expire something with a timestamp,” and both follow an incremental, time-bounded loop, and esepcially that the existing key expiry mechnaism is already optmized to not starve the engine. It feels natural to try to generalize the mechanism using function pointers, callbacks, or generic iterators to share the same control flow.

However, for me this fell apart because of a fundamental structural difference:

  • Key expiry operates on the top-level keyspace, where every db->key is an independent unit, randomly sampled, with a well-defined expiry time. Once you pick a key, you only have to check its expiry and delete it if needed. That is a single-level sampling problem.

-Field expiry is different. You do not have a flat namespace of all fields across the DB. Instead, you have a two-level hierarchy:

    sample a key from the DB containing volatile fields

    then scan inside its attached volatile set to find expired fields

    then decide if the hash itself should be deleted if all fields expire

Sampling: sampling volatile keys is meaningless since a single volatile key might hold 99% of the expirables, and treating all entries as a flat sample pool doesn’t work because volatile sets are already ordered and are scanned semi-deterministically. Sampling fundamentally behaves differently for keys vs. fields, making separate loops cleaner and correct.

Unifying expiry would require sampling across multiple levels, separate iterators for keys and volatile sets, tracking state across two iteration layers, handling mid-scan key deletions, and preventing starvation from large volatile sets.
While technically possible with a generic framework, the cost in complexity abd risk of subtle bugs, and maintenance is simply not justified IMHO.

@xbasel xbasel marked this pull request as draft June 30, 2025 15:24
ranshid pushed a commit that referenced this pull request Jul 1, 2025
…y-io#2257)

**Current state**
During `hashtableScanDefrag`, rehashing is paused to prevent entries
from moving, but the scan callback can still delete entries which
triggers `hashtableShrinkIfNeeded`. For example, the
`expireScanCallback` can delete expired entries.

**Issue**
This can cause the table to be resized and the old memory to be freed
while the scan is still accessing it, resulting in the following memory
access violation:

```
[err]: Sanitizer error: =================================================================
==46774==ERROR: AddressSanitizer: heap-use-after-free on address 0x611000003100 at pc 0x0000004704d3 bp 0x7fffcb062000 sp 0x7fffcb061ff0
READ of size 1 at 0x611000003100 thread T0
    #0 0x4704d2 in isPositionFilled /home/gusakovy/Projects/valkey/src/hashtable.c:422
    #1 0x478b45 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1768
    #2 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729
    #3 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402
    #4 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297
    #5 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269
    #6 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577
    #7 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370
    valkey-io#8 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513
    valkey-io#9 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543
    valkey-io#10 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291
    valkey-io#11 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139)
    valkey-io#12 0x452e39 in _start (/local/home/gusakovy/Projects/valkey/src/valkey-server+0x452e39)

0x611000003100 is located 0 bytes inside of 256-byte region [0x611000003100,0x611000003200)
freed by thread T0 here:
    #0 0x7f471a34a1e5 in __interceptor_free (/lib64/libasan.so.4+0xd81e5)
    #1 0x4aefbc in zfree_internal /home/gusakovy/Projects/valkey/src/zmalloc.c:400
    #2 0x4aeff5 in valkey_free /home/gusakovy/Projects/valkey/src/zmalloc.c:415
    #3 0x4707d2 in rehashingCompleted /home/gusakovy/Projects/valkey/src/hashtable.c:456
    #4 0x471b5b in resize /home/gusakovy/Projects/valkey/src/hashtable.c:656
    #5 0x475bff in hashtableShrinkIfNeeded /home/gusakovy/Projects/valkey/src/hashtable.c:1272
    #6 0x47704b in hashtablePop /home/gusakovy/Projects/valkey/src/hashtable.c:1448
    #7 0x47716f in hashtableDelete /home/gusakovy/Projects/valkey/src/hashtable.c:1459
    valkey-io#8 0x480038 in kvstoreHashtableDelete /home/gusakovy/Projects/valkey/src/kvstore.c:847
    valkey-io#9 0x50c12c in dbGenericDeleteWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:490
    valkey-io#10 0x515f28 in deleteExpiredKeyAndPropagateWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:1831
    valkey-io#11 0x516103 in deleteExpiredKeyAndPropagate /home/gusakovy/Projects/valkey/src/db.c:1844
    valkey-io#12 0x6d8642 in activeExpireCycleTryExpire /home/gusakovy/Projects/valkey/src/expire.c:70
    valkey-io#13 0x6d8706 in expireScanCallback /home/gusakovy/Projects/valkey/src/expire.c:139
    valkey-io#14 0x478bd8 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1770
    valkey-io#15 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729
    valkey-io#16 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402
    valkey-io#17 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297
    valkey-io#18 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269
    valkey-io#19 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577
    valkey-io#20 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370
    valkey-io#21 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513
    valkey-io#22 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543
    valkey-io#23 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291
    valkey-io#24 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139)

previously allocated by thread T0 here:
    #0 0x7f471a34a753 in __interceptor_calloc (/lib64/libasan.so.4+0xd8753)
    #1 0x4ae48c in ztrycalloc_usable_internal /home/gusakovy/Projects/valkey/src/zmalloc.c:214
    #2 0x4ae757 in valkey_calloc /home/gusakovy/Projects/valkey/src/zmalloc.c:257
    #3 0x4718fc in resize /home/gusakovy/Projects/valkey/src/hashtable.c:645
    #4 0x475bff in hashtableShrinkIfNeeded /home/gusakovy/Projects/valkey/src/hashtable.c:1272
    #5 0x47704b in hashtablePop /home/gusakovy/Projects/valkey/src/hashtable.c:1448
    #6 0x47716f in hashtableDelete /home/gusakovy/Projects/valkey/src/hashtable.c:1459
    #7 0x480038 in kvstoreHashtableDelete /home/gusakovy/Projects/valkey/src/kvstore.c:847
    valkey-io#8 0x50c12c in dbGenericDeleteWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:490
    valkey-io#9 0x515f28 in deleteExpiredKeyAndPropagateWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:1831
    valkey-io#10 0x516103 in deleteExpiredKeyAndPropagate /home/gusakovy/Projects/valkey/src/db.c:1844
    valkey-io#11 0x6d8642 in activeExpireCycleTryExpire /home/gusakovy/Projects/valkey/src/expire.c:70
    valkey-io#12 0x6d8706 in expireScanCallback /home/gusakovy/Projects/valkey/src/expire.c:139
    valkey-io#13 0x478bd8 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1770
    valkey-io#14 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729
    valkey-io#15 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402
    valkey-io#16 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297
    valkey-io#17 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269
    valkey-io#18 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577
    valkey-io#19 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370
    valkey-io#20 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513
    valkey-io#21 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543
    valkey-io#22 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291
    valkey-io#23 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139)

SUMMARY: AddressSanitizer: heap-use-after-free /home/gusakovy/Projects/valkey/src/hashtable.c:422 in isPositionFilled
Shadow bytes around the buggy address:
  0x0c227fff85d0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff85e0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff85f0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
  0x0c227fff8600: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff8610: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
=>0x0c227fff8620:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff8630: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff8640: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c227fff8650: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c227fff8660: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c227fff8670: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==46774==ABORTING
```


**Solution**
Suggested solution is to also pause auto shrinking during
`hashtableScanDefrag`. I noticed that there was already a
`hashtablePauseAutoShrink` method and `pause_auto_shrink` counter, but
it wasn't actually used in `hashtableShrinkIfNeeded` so I fixed that.

**Testing**
I created a simple tcl test that (most of the times) triggers this
error, but it's a little clunky so I didn't add it as part of the PR:

```
start_server {tags {"expire hashtable defrag"}} {
    test {hashtable scan defrag on expiry} {

        r config set hz 100

        set num_keys 20
        for {set i 0} {$i < $num_keys} {incr i} {
            r set "key_$i" "value_$i"
        }

        for {set j 0} {$j < 50} {incr j} {
            set expire_keys 100
            for {set i 0} {$i < $expire_keys} {incr i} {
                # Short expiry time to ensure they expire quickly
                r psetex "expire_key_${i}_${j}" 100 "expire_value_${i}_${j}"
            }

            # Verify keys are set
            set initial_size [r dbsize]
            assert_equal $initial_size [expr $num_keys + $expire_keys]
            
            after 150
            for {set i 0} {$i < 10} {incr i} {
                r get "expire_key_${i}_${j}"
                after 10
            }
        }

        set remaining_keys [r dbsize]
        assert_equal $remaining_keys $num_keys

        # Verify server is still responsive
        assert_equal [r ping] {PONG}
    } {}
}
```
Compiling with ASAN using `make noopt SANITIZER=address valkey-server`
and running the test causes error above. Applying the fix resolves the
issue.

Signed-off-by: Yakov Gusakov <[email protected]>
@xbasel xbasel force-pushed the field_activeexpiry branch 2 times, most recently from 6a140c6 to b3bcf4e Compare July 2, 2025 17:48
Copy link
Owner

@ranshid ranshid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just dumping some immediate comments. will follow up soon

@xbasel xbasel force-pushed the field_activeexpiry branch 2 times, most recently from a022b84 to 1224a0b Compare July 7, 2025 00:27
xbasel added 5 commits July 7, 2025 03:27
…ritization

This change introduces an active expiration mechanism for hash fields
with TTLs, in addition to the existing key-level expiry logic. Field TTL
information is tracked in volatile sets attached to each hash. A new
activeExpireCycleFields() function incrementally scans these volatile
sets to remove expired fields, reclaiming memory proactively.

Instead of unifying key and field expiry into a single scanning loop,
we chose to keep them as separate mechanisms. The primary reason is
that their iteration models and expiry pacing are fundamentally different:
- key expiry operates on db->key space with randomized sampling;
- field expiry operates on db->key->volatile_set with per-key iteration.

Because of this structural and pacing difference, a unified algorithm
would be complex and could introduce subtle correctness and fairness
bugs. By alternating between key expiry and field expiry each cycle,
we maintain fairness while staying within the existing effort budget,
ensuring consistent latency and predictable CPU usage.

    +-----------------+
    |     DB          |
    +-----------------+
            |
            v
    +---------------------+
    |     myhash          |  (key with TTL)
    +---------------------+
            |
            v
    +------------------------------------+
    | fields (hashType)                  |
    |  - field1                          |
    |  - field2                          |
    |  - fieldN                          |
    +------------------------------------+
            |
            v
    +------------------------------------+
    | volatile set (field-level TTL)     |
    |  - field1 expires at T1            |
    |  - field5 expires at T5            |
    +------------------------------------+

Key expiry operates at the DB->key level, while field expiry operates
at the DB->key->volatile_set level, justifying their separate logic.

No new configuration was introduced; the existing active-expire-effort
and time budget are reused for both key and field expiry.

Signed-off-by: xbasel <[email protected]>
@xbasel xbasel force-pushed the field_activeexpiry branch from 1224a0b to 8e4bfcc Compare July 7, 2025 00:29
@xbasel xbasel changed the title active-expire: add field-level TTL expiry with alternating cycle prioritization active-expire: add field-level TTL active expiry Jul 7, 2025
@xbasel xbasel marked this pull request as ready for review July 7, 2025 01:22
Copy link
Owner

@ranshid ranshid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not review the entire change yet.

Lets re-iterate after this bulk

ranshid added 4 commits July 31, 2025 00:05
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
1. make sure expire cycle latency is acounted for the entire cycle
2. make new trace points for the 2 expire job types
3. make expiration staleness be accounted independently per each cycle type

Signed-off-by: Ran Shidlansik <[email protected]>
@ranshid
Copy link
Owner

ranshid commented Jul 31, 2025

It suddenly hit me that we are no longer calculating the expire cycle latency correctly.
Also the expired_stale percentage should be calculated differently per each job type.

@xbasel / @JimB123 care to take a quick look?

@xbasel
Copy link
Collaborator Author

xbasel commented Jul 31, 2025

It suddenly hit me that we are no longer calculating the expire cycle latency correctly. Also the expired_stale percentage should be calculated differently per each job type.

@xbasel / @JimB123 care to take a quick look?

Mixing both key and field expiry in stat_expired_stale_perc doesn't seem right. We should track them independently.
As for latency, expire-latency now reflects both key and field TTL processing since they share the same handler, but it's unclear if we should record latency once in the outer function or separately (as its done now).

@ranshid
Copy link
Owner

ranshid commented Jul 31, 2025

It suddenly hit me that we are no longer calculating the expire cycle latency correctly. Also the expired_stale percentage should be calculated differently per each job type.
@xbasel / @JimB123 care to take a quick look?

Mixing both key and field expiry in stat_expired_stale_perc doesn't seem right. We should track them independently. As for latency, expire-latency now reflects both key and field TTL processing since they share the same handler, but it's unclear if we should record latency once in the outer function or separately (as its done now).

Right now I am NOT mixing the stat_expired_stale_perc, which is not the case before the last fix. NOW we do track them independently.

expire-latency now reflects both key and field TTL processing since they share the same handler, but it's unclear if we should record latency once in the outer function or separately (as its done now).

IMO doing it differently will just change the reporting of the active expiry statistic. Since we decide to share the CPU time for both expiration jobs, it should be tracked for both.

ranshid added 8 commits August 3, 2025 15:46
…o introduce-active-expiry

Signed-off-by: Ran Shidlansik <[email protected]>
1. perform most of the actions arround hashTypeExpireEntry
2. make sure to propagate hdel even in case the hash was freed.

Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Copy link

@JimB123 JimB123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Active Expiry code and Active Defrag code reviewed.

ranshid added a commit that referenced this pull request Aug 5, 2025
Closes valkey-io#640

This PR introduces support for **field-level expiration in Valkey hash types**, making it possible for individual fields inside a hash to expire independently — creating what we call **volatile fields**.
This is just the first out of 3 PRs. The content of this PR focus on enabling the basic ability to set and modify hash fields expiration as well as persistency (AOF+RDB) and defrag.
[The second PR](#5) introduces the new algorithm (volatile-set) to track volatile hash fields is in the last stages of review. The current implementation in this PR (in volatile-set.h/c) is just s tub implementation and will be replaced by [The second PR](#5)
[The third PR](#4) which introduces the active expiration and defragmentation jobs.

For more highlevel design details you can track the RFC PR: valkey-io/valkey-rfc#22.

---

Some highlevel major decisions which are taken as part of this work:
1. We decided to copy the existing Redis API in order to maintain compatibility with existing clients.
2. We decided to avoid introducing lazy-expiration at this point, in order to reduce complexity and rely only on active-expiration for memory reclamation. This will require us to continue to work on improving the active expiration job and potentially consider introduce lazy-expiration support later on.
3. Although different commands which are adding expiration on hash fields are influencing the memory utilization (by allocating more memory for expiration time and metadata) we decided to avoid adding the DENYOOM for these commands (an exception is HSETEX) in order to be better aligned with highlevel keys commands like `expire`
4. Some hash type commands will produce unexpected results:
 - HLEN - will still reflect the number of fields which exists in the hash object (either actually expired or not).
 - HRANDFIELD - in some cases we will not be able to randomly select a field which was not already expired. this case happen in 2 cases: 1/ when we are asked to provide a non-uniq fields (i.e negative count) 2/ when the size of the hash is much bigger than the count and we need to provide uniq results. In both cases it is possible that an empty response will be returned to the caller, even in case there are fields in the hash which are either persistent or not expired.
5. For the case were a field is provided with a zero (0) expiration time or expiration time in the past, it is immediately deleted. We decided that, in order to be aligned with how high level keys are handled, we will emit hexpired keyspace event for that case (instead of hdel). For example:
for the case:
6. We will ALWAYS load hash fields during rdb load. This means that when primary is rebooting with an old snapshot, it will take time to reclaim all the expired fields. However this simplifies the current logic and avoid major refactoring that I suspect will be needed.
```
HSET myhash f1 v1
> 0
HGETEX myhash EX 0 FIELDS 1 f1
> "v1"
HTTL myhash FIELDS 1 f1
>  -2
```

The reported events are:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

This PR also **modularizes and exposes the internal `hashTypeEntry` logic** as a new standalone `entry.c/h` module. This new abstraction handles all aspects of **field–value–expiry encoding** using multiple memory layouts optimized for performance and memory efficiency.

An `entry` is an abstraction that represents a single **field–value pair with optional expiration**. Internally, Valkey uses different memory layouts for compactness and efficiency, chosen dynamically based on size and encoding constraints.

The entry pointer is the field sds. Which make us use an entry just like any sds. We encode the entry layout type
in the field SDS header. Field type SDS_TYPE_5 doesn't have any spare bits to
encode this so we use it only for the first layout type.

Entry with embedded value, used for small sizes. The value is stored as
SDS_TYPE_8. The field can use any SDS type.

Entry can also have expiration timestamp, which is the UNIX timestamp for it to be expired.
For aligned fast access, we keep the expiry timestamp prior to the start of the sds header.

     +----------------+--------------+---------------+
     | Expiration     | field        | value         |
     | 1234567890LL   | hdr "foo" \0 | hdr8 "bar" \0 |
     +-----------------------^-------+---------------+
                             |
                             |
                            entry pointer (points to field sds content)

Entry with value pointer, used for larger fields and values. The field is SDS
type 8 or higher.

     +--------------+-------+--------------+
     | Expiration   | value | field        |
     | 1234567890LL | ptr   | hdr "foo" \0 |
     +--------------+--^----+------^-------+
                       |           |
                       |           |
                       |         entry pointer (points to field sds content)
                       |
                      value pointer = value sds

The `entry.c/h` API provides methods to:
- Create, read, and write and Update field/value/expiration
- Set or clear expiration
- Check expiration state
- Clone or delete an entry

---

This PR introduces **new commands** and extends existing ones to support field expiration:

The proposed API is very much identical to the Redis provided API (Redis 7.4 + 8.0). This is intentionally proposed in order to avoid breaking client applications already opted to use hash items TTL.

**Synopsis**

```
HSETEX key [NX | XX] [FNX | FXX] [EX seconds | PX milliseconds |
  EXAT unix-time-seconds | PXAT unix-time-milliseconds | KEEPTTL]
  FIELDS numfields field value [field value ...]
```

Set the value of one or more fields of a given hash key, and optionally set their expiration time or time-to-live (TTL).

The HSETEX command supports the following set of options:

* `NX` — Only set the fields if the hash object does NOT exist.
* `XX` — Only set the fields if if the hash object doesx exist.
* `FNX` — Only set the fields if none of them already exist.
* `FXX` — Only set the fields if all of them already exist.
* `EX seconds` — Set the specified expiration time in seconds.
* `PX milliseconds` — Set the specified expiration time in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time in seconds at which the fields will expire.
* `PXAT unix-time-milliseconds` — Set the specified Unix time in milliseconds at which the fields will expire.
* `KEEPTTL` — Retain the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `KEEPTTL` options are mutually exclusive.

**Synopsis**

```
HGETEX key [EX seconds | PX milliseconds | EXAT unix-time-seconds |
  PXAT unix-time-milliseconds | PERSIST] FIELDS numfields field
  [field ...]
```

Get the value of one or more fields of a given hash key and optionally set their expiration time or time-to-live (TTL).

The `HGETEX` command supports a set of options:

* `EX seconds` — Set the specified expiration time, in seconds.
* `PX milliseconds` — Set the specified expiration time, in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time at which the fields will expire, in seconds.
* `PXAT unix-time-milliseconds` — Set the specified Unix time at which the fields will expire, in milliseconds.
* `PERSIST` — Remove the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `PERSIST` options are mutually exclusive.

**Synopsis**

```
HEXPIRE key seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

Set an expiration (TTL or time to live) on one or more fields of a given hash key. You must specify at least one field. Field(s) will automatically be deleted from the hash key when their TTLs expire.
Field expirations will only be cleared by commands that delete or overwrite the contents of the hash fields, including `HDEL` and `HSET` commands. This means that all the operations that conceptually *alter* the value stored at a hash key's field without replacing it with a new one will leave the TTL untouched.
You can clear the TTL of a specific field by specifying 0 for the ‘seconds’ argument.
Note that calling `HEXPIRE`/`HPEXPIRE` with a time in the past will result in the hash field being deleted immediately.

The `HEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HEXPIREAT key unix-time-seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

`HEXPIREAT` has the same effect and semantics as `HEXPIRE`, but instead of specifying the number of seconds for the TTL (time to live), it takes an absolute Unix timestamp in seconds since Unix epoch. A timestamp in the past will delete the field immediately.

The `HEXPIREAT` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIRE key milliseconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

This command works like `HEXPIRE`, but the expiration of a field is specified in milliseconds instead of seconds.

The `HPEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIREAT key unix-time-milliseconds [NX | XX | GT | LT]
  FIELDS numfields field [field ...]
```

`HPEXPIREAT` has the same effect and semantics as `HEXPIREAT``,` but the Unix time at which the field will expire is specified in milliseconds since Unix epoch instead of seconds.

**Synopsis**

```
HPERSIST key FIELDS numfields field [field ...]
```

Remove the existing expiration on a hash key's field(s), turning the field(s) from *volatile* (a field with expiration set) to *persistent* (a field that will never expire as no TTL (time to live) is associated).

**Synopsis**

```
HSETEX key [NX] seconds field value [field value ...]
```

Similar to `HSET` but adds one or more hash fields that expire after specified number of seconds. By default, this command overwrites the values and expirations of specified fields that exist in the hash. If `NX` option is specified, the field data will not be overwritten. If `key` doesn't exist, a new Hash key is created.

The HSETEX command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.

**Synopsis**

```
HTTL key FIELDS numfields field [field ...]
```

Returns the **remaining** TTL (time to live) of a hash key's field(s) that have a set expiration. This introspection capability allows you to check how many seconds a given hash field will continue to be part of the hash key.

```
HPTTL key FIELDS numfields field [field ...]
```

Like `HTTL`, this command returns the remaining TTL (time to live) of a field that has an expiration set, but in milliseconds instead of seconds.

**Synopsis**

```
HEXPIRETIME key FIELDS numfields field [field ...]
```

Returns the absolute Unix timestamp in seconds since Unix epoch at which the given key's field(s) will expire.

**Synopsis**

```
HPEXPIRETIME key FIELDS numfields field [field ...]
```

`HPEXPIRETIME` has the same semantics as `HEXPIRETIME`, but returns the absolute Unix expiration timestamp in milliseconds since Unix epoch instead of seconds.

This PR introduces new notification events to support field-level expiration:

| Event       | Trigger                                  |
|-------------|-------------------------------------------|
| `hexpire`   | Field expiration was set                  |
| `hexpired`  | Field was deleted due to expiration       |
| `hpersist`  | Expiration was removed from a field       |
| `del`       | Key was deleted after all fields expired  |

Note that we diverge from Redis in the cases we emit hexpired event.
For example:
given the following usecase:
```
HSET myhash f1 v1
(integer) 0
HGETEX myhash EX 0 FIELDS 1 f1
1) "v1"
 HTTL myhash FIELDS 1 f1
1) (integer) -2
```
regarding the keyspace-notifications:
Redis reports:
```
1) "psubscribe"
2) "__keyevent@0__:*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hset"
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hdel" <---------------- note this
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:del"
4) "myhash2"
```

However In our current suggestion, Valkey will emit:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

- Expiration-aware commands (`HSETEX`, `HGETEX`, etc.) are **not propagated as-is**.
- Instead, Valkey rewrites them into equivalent commands like:
  - `HDEL` (for expired fields)
  - `HPEXPIREAT` (for setting absolute expiration)
  - `HPERSIST` (for removing expiration)

This ensures compatibility with replication and AOF while maintaining consistent field-level expiry behavior.

---

| Command Name | QPS Standard | QPS HFE | QPS Diff % | Latency Standard (ms) | Latency HFE (ms) | Latency Diff % |
|--------------|-------------|---------|------------|----------------------|------------------|----------------|
| **One Large Hash Table** |
| HGET | 137988.12 | 138484.97 | +0.36% | 0.951 | 0.949 | -0.21% |
| HSET | 138561.73 | 137343.77 | -0.87% | 0.948 | 0.956 | +0.84% |
| HEXISTS | 139431.12 | 138677.02 | -0.54% | 0.942 | 0.946 | +0.42% |
| HDEL | 140114.89 | 138966.09 | -0.81% | 0.938 | 0.945 | +0.74% |
| **Many Hash Tables (100 fields)** |
| HGET | 136798.91 | 137419.27 | +0.45% | 0.959 | 0.956 | -0.31% |
| HEXISTS | 138946.78 | 139645.31 | +0.50% | 0.946 | 0.941 | -0.52% |
| HGETALL | 42194.09 | 42016.80 | -0.42% | 0.621 | 0.625 | +0.64% |
| HSET | 137230.69 | 137249.53 | +0.01% | 0.959 | 0.958 | -0.10% |
| HDEL | 138985.41 | 138619.34 | -0.26% | 0.948 | 0.949 | +0.10% |
| **Many Hash Tables (1000 fields)** |
| HGET | 135795.77 | 139256.36 | +2.54% | 0.965 | 0.943 | -2.27% |
| HEXISTS | 138121.55 | 137950.06 | -0.12% | 0.951 | 0.952 | +0.10% |
| HGETALL | 5885.81 | 5633.80 | **-4.28%** | 2.690 | 2.841 | **+5.61%** |
| HSET | 137005.08 | 137400.39 | +0.28% | 0.959 | 0.955 | -0.41% |
| HDEL | 138293.45 | 137381.52 | -0.65% | 0.948 | 0.955 | +0.73% |

[ ] Consider extending HSETEX with extra arguments: NX/XX so that it is possible to prevent adding/setting/mutating fields of a non-existent hash
[ ] Avoid loading expired fields when non-preamble RDB is being loaded on primary. This is an optimization in order to reduce loading unnecessary fields (which are expired). This would also require us to propagate the HDEL to the replicas in case of RDBFLAGS_FEED_REPL. Note that it might have to require some refactoring:
1/ propagate the rdbflags and current time to rdbLoadObject. 2/ consider the case of restore and check_rdb etc...
For this reason I would like to avoid this optimizationfor the first drop.
ranshid added a commit that referenced this pull request Aug 5, 2025
Closes valkey-io#640

This PR introduces support for **field-level expiration in Valkey hash types**, making it possible for individual fields inside a hash to expire independently — creating what we call **volatile fields**.
This is just the first out of 3 PRs. The content of this PR focus on enabling the basic ability to set and modify hash fields expiration as well as persistency (AOF+RDB) and defrag.
[The second PR](#5) introduces the new algorithm (volatile-set) to track volatile hash fields is in the last stages of review. The current implementation in this PR (in volatile-set.h/c) is just s tub implementation and will be replaced by [The second PR](#5)
[The third PR](#4) which introduces the active expiration and defragmentation jobs.

For more highlevel design details you can track the RFC PR: valkey-io/valkey-rfc#22.

---

Some highlevel major decisions which are taken as part of this work:
1. We decided to copy the existing Redis API in order to maintain compatibility with existing clients.
2. We decided to avoid introducing lazy-expiration at this point, in order to reduce complexity and rely only on active-expiration for memory reclamation. This will require us to continue to work on improving the active expiration job and potentially consider introduce lazy-expiration support later on.
3. Although different commands which are adding expiration on hash fields are influencing the memory utilization (by allocating more memory for expiration time and metadata) we decided to avoid adding the DENYOOM for these commands (an exception is HSETEX) in order to be better aligned with highlevel keys commands like `expire`
4. Some hash type commands will produce unexpected results:
 - HLEN - will still reflect the number of fields which exists in the hash object (either actually expired or not).
 - HRANDFIELD - in some cases we will not be able to randomly select a field which was not already expired. this case happen in 2 cases: 1/ when we are asked to provide a non-uniq fields (i.e negative count) 2/ when the size of the hash is much bigger than the count and we need to provide uniq results. In both cases it is possible that an empty response will be returned to the caller, even in case there are fields in the hash which are either persistent or not expired.
5. For the case were a field is provided with a zero (0) expiration time or expiration time in the past, it is immediately deleted. We decided that, in order to be aligned with how high level keys are handled, we will emit hexpired keyspace event for that case (instead of hdel). For example:
for the case:
6. We will ALWAYS load hash fields during rdb load. This means that when primary is rebooting with an old snapshot, it will take time to reclaim all the expired fields. However this simplifies the current logic and avoid major refactoring that I suspect will be needed.
```
HSET myhash f1 v1
> 0
HGETEX myhash EX 0 FIELDS 1 f1
> "v1"
HTTL myhash FIELDS 1 f1
>  -2
```

The reported events are:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

This PR also **modularizes and exposes the internal `hashTypeEntry` logic** as a new standalone `entry.c/h` module. This new abstraction handles all aspects of **field–value–expiry encoding** using multiple memory layouts optimized for performance and memory efficiency.

An `entry` is an abstraction that represents a single **field–value pair with optional expiration**. Internally, Valkey uses different memory layouts for compactness and efficiency, chosen dynamically based on size and encoding constraints.

The entry pointer is the field sds. Which make us use an entry just like any sds. We encode the entry layout type
in the field SDS header. Field type SDS_TYPE_5 doesn't have any spare bits to
encode this so we use it only for the first layout type.

Entry with embedded value, used for small sizes. The value is stored as
SDS_TYPE_8. The field can use any SDS type.

Entry can also have expiration timestamp, which is the UNIX timestamp for it to be expired.
For aligned fast access, we keep the expiry timestamp prior to the start of the sds header.

     +----------------+--------------+---------------+
     | Expiration     | field        | value         |
     | 1234567890LL   | hdr "foo" \0 | hdr8 "bar" \0 |
     +-----------------------^-------+---------------+
                             |
                             |
                            entry pointer (points to field sds content)

Entry with value pointer, used for larger fields and values. The field is SDS
type 8 or higher.

     +--------------+-------+--------------+
     | Expiration   | value | field        |
     | 1234567890LL | ptr   | hdr "foo" \0 |
     +--------------+--^----+------^-------+
                       |           |
                       |           |
                       |         entry pointer (points to field sds content)
                       |
                      value pointer = value sds

The `entry.c/h` API provides methods to:
- Create, read, and write and Update field/value/expiration
- Set or clear expiration
- Check expiration state
- Clone or delete an entry

---

This PR introduces **new commands** and extends existing ones to support field expiration:

The proposed API is very much identical to the Redis provided API (Redis 7.4 + 8.0). This is intentionally proposed in order to avoid breaking client applications already opted to use hash items TTL.

**Synopsis**

```
HSETEX key [NX | XX] [FNX | FXX] [EX seconds | PX milliseconds |
  EXAT unix-time-seconds | PXAT unix-time-milliseconds | KEEPTTL]
  FIELDS numfields field value [field value ...]
```

Set the value of one or more fields of a given hash key, and optionally set their expiration time or time-to-live (TTL).

The HSETEX command supports the following set of options:

* `NX` — Only set the fields if the hash object does NOT exist.
* `XX` — Only set the fields if if the hash object doesx exist.
* `FNX` — Only set the fields if none of them already exist.
* `FXX` — Only set the fields if all of them already exist.
* `EX seconds` — Set the specified expiration time in seconds.
* `PX milliseconds` — Set the specified expiration time in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time in seconds at which the fields will expire.
* `PXAT unix-time-milliseconds` — Set the specified Unix time in milliseconds at which the fields will expire.
* `KEEPTTL` — Retain the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `KEEPTTL` options are mutually exclusive.

**Synopsis**

```
HGETEX key [EX seconds | PX milliseconds | EXAT unix-time-seconds |
  PXAT unix-time-milliseconds | PERSIST] FIELDS numfields field
  [field ...]
```

Get the value of one or more fields of a given hash key and optionally set their expiration time or time-to-live (TTL).

The `HGETEX` command supports a set of options:

* `EX seconds` — Set the specified expiration time, in seconds.
* `PX milliseconds` — Set the specified expiration time, in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time at which the fields will expire, in seconds.
* `PXAT unix-time-milliseconds` — Set the specified Unix time at which the fields will expire, in milliseconds.
* `PERSIST` — Remove the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `PERSIST` options are mutually exclusive.

**Synopsis**

```
HEXPIRE key seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

Set an expiration (TTL or time to live) on one or more fields of a given hash key. You must specify at least one field. Field(s) will automatically be deleted from the hash key when their TTLs expire.
Field expirations will only be cleared by commands that delete or overwrite the contents of the hash fields, including `HDEL` and `HSET` commands. This means that all the operations that conceptually *alter* the value stored at a hash key's field without replacing it with a new one will leave the TTL untouched.
You can clear the TTL of a specific field by specifying 0 for the ‘seconds’ argument.
Note that calling `HEXPIRE`/`HPEXPIRE` with a time in the past will result in the hash field being deleted immediately.

The `HEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HEXPIREAT key unix-time-seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

`HEXPIREAT` has the same effect and semantics as `HEXPIRE`, but instead of specifying the number of seconds for the TTL (time to live), it takes an absolute Unix timestamp in seconds since Unix epoch. A timestamp in the past will delete the field immediately.

The `HEXPIREAT` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIRE key milliseconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

This command works like `HEXPIRE`, but the expiration of a field is specified in milliseconds instead of seconds.

The `HPEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIREAT key unix-time-milliseconds [NX | XX | GT | LT]
  FIELDS numfields field [field ...]
```

`HPEXPIREAT` has the same effect and semantics as `HEXPIREAT``,` but the Unix time at which the field will expire is specified in milliseconds since Unix epoch instead of seconds.

**Synopsis**

```
HPERSIST key FIELDS numfields field [field ...]
```

Remove the existing expiration on a hash key's field(s), turning the field(s) from *volatile* (a field with expiration set) to *persistent* (a field that will never expire as no TTL (time to live) is associated).

**Synopsis**

```
HSETEX key [NX] seconds field value [field value ...]
```

Similar to `HSET` but adds one or more hash fields that expire after specified number of seconds. By default, this command overwrites the values and expirations of specified fields that exist in the hash. If `NX` option is specified, the field data will not be overwritten. If `key` doesn't exist, a new Hash key is created.

The HSETEX command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.

**Synopsis**

```
HTTL key FIELDS numfields field [field ...]
```

Returns the **remaining** TTL (time to live) of a hash key's field(s) that have a set expiration. This introspection capability allows you to check how many seconds a given hash field will continue to be part of the hash key.

```
HPTTL key FIELDS numfields field [field ...]
```

Like `HTTL`, this command returns the remaining TTL (time to live) of a field that has an expiration set, but in milliseconds instead of seconds.

**Synopsis**

```
HEXPIRETIME key FIELDS numfields field [field ...]
```

Returns the absolute Unix timestamp in seconds since Unix epoch at which the given key's field(s) will expire.

**Synopsis**

```
HPEXPIRETIME key FIELDS numfields field [field ...]
```

`HPEXPIRETIME` has the same semantics as `HEXPIRETIME`, but returns the absolute Unix expiration timestamp in milliseconds since Unix epoch instead of seconds.

This PR introduces new notification events to support field-level expiration:

| Event       | Trigger                                  |
|-------------|-------------------------------------------|
| `hexpire`   | Field expiration was set                  |
| `hexpired`  | Field was deleted due to expiration       |
| `hpersist`  | Expiration was removed from a field       |
| `del`       | Key was deleted after all fields expired  |

Note that we diverge from Redis in the cases we emit hexpired event.
For example:
given the following usecase:
```
HSET myhash f1 v1
(integer) 0
HGETEX myhash EX 0 FIELDS 1 f1
1) "v1"
 HTTL myhash FIELDS 1 f1
1) (integer) -2
```
regarding the keyspace-notifications:
Redis reports:
```
1) "psubscribe"
2) "__keyevent@0__:*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hset"
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hdel" <---------------- note this
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:del"
4) "myhash2"
```

However In our current suggestion, Valkey will emit:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

- Expiration-aware commands (`HSETEX`, `HGETEX`, etc.) are **not propagated as-is**.
- Instead, Valkey rewrites them into equivalent commands like:
  - `HDEL` (for expired fields)
  - `HPEXPIREAT` (for setting absolute expiration)
  - `HPERSIST` (for removing expiration)

This ensures compatibility with replication and AOF while maintaining consistent field-level expiry behavior.

---

| Command Name | QPS Standard | QPS HFE | QPS Diff % | Latency Standard (ms) | Latency HFE (ms) | Latency Diff % |
|--------------|-------------|---------|------------|----------------------|------------------|----------------|
| **One Large Hash Table** |
| HGET | 137988.12 | 138484.97 | +0.36% | 0.951 | 0.949 | -0.21% |
| HSET | 138561.73 | 137343.77 | -0.87% | 0.948 | 0.956 | +0.84% |
| HEXISTS | 139431.12 | 138677.02 | -0.54% | 0.942 | 0.946 | +0.42% |
| HDEL | 140114.89 | 138966.09 | -0.81% | 0.938 | 0.945 | +0.74% |
| **Many Hash Tables (100 fields)** |
| HGET | 136798.91 | 137419.27 | +0.45% | 0.959 | 0.956 | -0.31% |
| HEXISTS | 138946.78 | 139645.31 | +0.50% | 0.946 | 0.941 | -0.52% |
| HGETALL | 42194.09 | 42016.80 | -0.42% | 0.621 | 0.625 | +0.64% |
| HSET | 137230.69 | 137249.53 | +0.01% | 0.959 | 0.958 | -0.10% |
| HDEL | 138985.41 | 138619.34 | -0.26% | 0.948 | 0.949 | +0.10% |
| **Many Hash Tables (1000 fields)** |
| HGET | 135795.77 | 139256.36 | +2.54% | 0.965 | 0.943 | -2.27% |
| HEXISTS | 138121.55 | 137950.06 | -0.12% | 0.951 | 0.952 | +0.10% |
| HGETALL | 5885.81 | 5633.80 | **-4.28%** | 2.690 | 2.841 | **+5.61%** |
| HSET | 137005.08 | 137400.39 | +0.28% | 0.959 | 0.955 | -0.41% |
| HDEL | 138293.45 | 137381.52 | -0.65% | 0.948 | 0.955 | +0.73% |

[ ] Consider extending HSETEX with extra arguments: NX/XX so that it is possible to prevent adding/setting/mutating fields of a non-existent hash
[ ] Avoid loading expired fields when non-preamble RDB is being loaded on primary. This is an optimization in order to reduce loading unnecessary fields (which are expired). This would also require us to propagate the HDEL to the replicas in case of RDBFLAGS_FEED_REPL. Note that it might have to require some refactoring:
1/ propagate the rdbflags and current time to rdbLoadObject. 2/ consider the case of restore and check_rdb etc...
For this reason I would like to avoid this optimizationfor the first drop.

Signed-off-by: Ran Shidlansik <[email protected]>
@ranshid ranshid closed this Aug 5, 2025
@ranshid
Copy link
Owner

ranshid commented Aug 5, 2025

This was manually cherry-picked into valkey-io#2089

ranshid added a commit that referenced this pull request Aug 5, 2025
Closes valkey-io#640

This PR introduces support for **field-level expiration in Valkey hash types**, making it possible for individual fields inside a hash to expire independently — creating what we call **volatile fields**.
This is just the first out of 3 PRs. The content of this PR focus on enabling the basic ability to set and modify hash fields expiration as well as persistency (AOF+RDB) and defrag.
[The second PR](#5) introduces the new algorithm (volatile-set) to track volatile hash fields is in the last stages of review. The current implementation in this PR (in volatile-set.h/c) is just s tub implementation and will be replaced by [The second PR](#5)
[The third PR](#4) which introduces the active expiration and defragmentation jobs.

For more highlevel design details you can track the RFC PR: valkey-io/valkey-rfc#22.

---

Some highlevel major decisions which are taken as part of this work:
1. We decided to copy the existing Redis API in order to maintain compatibility with existing clients.
2. We decided to avoid introducing lazy-expiration at this point, in order to reduce complexity and rely only on active-expiration for memory reclamation. This will require us to continue to work on improving the active expiration job and potentially consider introduce lazy-expiration support later on.
3. Although different commands which are adding expiration on hash fields are influencing the memory utilization (by allocating more memory for expiration time and metadata) we decided to avoid adding the DENYOOM for these commands (an exception is HSETEX) in order to be better aligned with highlevel keys commands like `expire`
4. Some hash type commands will produce unexpected results:
 - HLEN - will still reflect the number of fields which exists in the hash object (either actually expired or not).
 - HRANDFIELD - in some cases we will not be able to randomly select a field which was not already expired. this case happen in 2 cases: 1/ when we are asked to provide a non-uniq fields (i.e negative count) 2/ when the size of the hash is much bigger than the count and we need to provide uniq results. In both cases it is possible that an empty response will be returned to the caller, even in case there are fields in the hash which are either persistent or not expired.
5. For the case were a field is provided with a zero (0) expiration time or expiration time in the past, it is immediately deleted. We decided that, in order to be aligned with how high level keys are handled, we will emit hexpired keyspace event for that case (instead of hdel). For example:
for the case:
6. We will ALWAYS load hash fields during rdb load. This means that when primary is rebooting with an old snapshot, it will take time to reclaim all the expired fields. However this simplifies the current logic and avoid major refactoring that I suspect will be needed.
```
HSET myhash f1 v1
> 0
HGETEX myhash EX 0 FIELDS 1 f1
> "v1"
HTTL myhash FIELDS 1 f1
>  -2
```

The reported events are:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

This PR also **modularizes and exposes the internal `hashTypeEntry` logic** as a new standalone `entry.c/h` module. This new abstraction handles all aspects of **field–value–expiry encoding** using multiple memory layouts optimized for performance and memory efficiency.

An `entry` is an abstraction that represents a single **field–value pair with optional expiration**. Internally, Valkey uses different memory layouts for compactness and efficiency, chosen dynamically based on size and encoding constraints.

The entry pointer is the field sds. Which make us use an entry just like any sds. We encode the entry layout type
in the field SDS header. Field type SDS_TYPE_5 doesn't have any spare bits to
encode this so we use it only for the first layout type.

Entry with embedded value, used for small sizes. The value is stored as
SDS_TYPE_8. The field can use any SDS type.

Entry can also have expiration timestamp, which is the UNIX timestamp for it to be expired.
For aligned fast access, we keep the expiry timestamp prior to the start of the sds header.

     +----------------+--------------+---------------+
     | Expiration     | field        | value         |
     | 1234567890LL   | hdr "foo" \0 | hdr8 "bar" \0 |
     +-----------------------^-------+---------------+
                             |
                             |
                            entry pointer (points to field sds content)

Entry with value pointer, used for larger fields and values. The field is SDS
type 8 or higher.

     +--------------+-------+--------------+
     | Expiration   | value | field        |
     | 1234567890LL | ptr   | hdr "foo" \0 |
     +--------------+--^----+------^-------+
                       |           |
                       |           |
                       |         entry pointer (points to field sds content)
                       |
                      value pointer = value sds

The `entry.c/h` API provides methods to:
- Create, read, and write and Update field/value/expiration
- Set or clear expiration
- Check expiration state
- Clone or delete an entry

---

This PR introduces **new commands** and extends existing ones to support field expiration:

The proposed API is very much identical to the Redis provided API (Redis 7.4 + 8.0). This is intentionally proposed in order to avoid breaking client applications already opted to use hash items TTL.

**Synopsis**

```
HSETEX key [NX | XX] [FNX | FXX] [EX seconds | PX milliseconds |
  EXAT unix-time-seconds | PXAT unix-time-milliseconds | KEEPTTL]
  FIELDS numfields field value [field value ...]
```

Set the value of one or more fields of a given hash key, and optionally set their expiration time or time-to-live (TTL).

The HSETEX command supports the following set of options:

* `NX` — Only set the fields if the hash object does NOT exist.
* `XX` — Only set the fields if if the hash object doesx exist.
* `FNX` — Only set the fields if none of them already exist.
* `FXX` — Only set the fields if all of them already exist.
* `EX seconds` — Set the specified expiration time in seconds.
* `PX milliseconds` — Set the specified expiration time in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time in seconds at which the fields will expire.
* `PXAT unix-time-milliseconds` — Set the specified Unix time in milliseconds at which the fields will expire.
* `KEEPTTL` — Retain the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `KEEPTTL` options are mutually exclusive.

**Synopsis**

```
HGETEX key [EX seconds | PX milliseconds | EXAT unix-time-seconds |
  PXAT unix-time-milliseconds | PERSIST] FIELDS numfields field
  [field ...]
```

Get the value of one or more fields of a given hash key and optionally set their expiration time or time-to-live (TTL).

The `HGETEX` command supports a set of options:

* `EX seconds` — Set the specified expiration time, in seconds.
* `PX milliseconds` — Set the specified expiration time, in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time at which the fields will expire, in seconds.
* `PXAT unix-time-milliseconds` — Set the specified Unix time at which the fields will expire, in milliseconds.
* `PERSIST` — Remove the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `PERSIST` options are mutually exclusive.

**Synopsis**

```
HEXPIRE key seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

Set an expiration (TTL or time to live) on one or more fields of a given hash key. You must specify at least one field. Field(s) will automatically be deleted from the hash key when their TTLs expire.
Field expirations will only be cleared by commands that delete or overwrite the contents of the hash fields, including `HDEL` and `HSET` commands. This means that all the operations that conceptually *alter* the value stored at a hash key's field without replacing it with a new one will leave the TTL untouched.
You can clear the TTL of a specific field by specifying 0 for the ‘seconds’ argument.
Note that calling `HEXPIRE`/`HPEXPIRE` with a time in the past will result in the hash field being deleted immediately.

The `HEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HEXPIREAT key unix-time-seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

`HEXPIREAT` has the same effect and semantics as `HEXPIRE`, but instead of specifying the number of seconds for the TTL (time to live), it takes an absolute Unix timestamp in seconds since Unix epoch. A timestamp in the past will delete the field immediately.

The `HEXPIREAT` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIRE key milliseconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

This command works like `HEXPIRE`, but the expiration of a field is specified in milliseconds instead of seconds.

The `HPEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIREAT key unix-time-milliseconds [NX | XX | GT | LT]
  FIELDS numfields field [field ...]
```

`HPEXPIREAT` has the same effect and semantics as `HEXPIREAT``,` but the Unix time at which the field will expire is specified in milliseconds since Unix epoch instead of seconds.

**Synopsis**

```
HPERSIST key FIELDS numfields field [field ...]
```

Remove the existing expiration on a hash key's field(s), turning the field(s) from *volatile* (a field with expiration set) to *persistent* (a field that will never expire as no TTL (time to live) is associated).

**Synopsis**

```
HSETEX key [NX] seconds field value [field value ...]
```

Similar to `HSET` but adds one or more hash fields that expire after specified number of seconds. By default, this command overwrites the values and expirations of specified fields that exist in the hash. If `NX` option is specified, the field data will not be overwritten. If `key` doesn't exist, a new Hash key is created.

The HSETEX command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.

**Synopsis**

```
HTTL key FIELDS numfields field [field ...]
```

Returns the **remaining** TTL (time to live) of a hash key's field(s) that have a set expiration. This introspection capability allows you to check how many seconds a given hash field will continue to be part of the hash key.

```
HPTTL key FIELDS numfields field [field ...]
```

Like `HTTL`, this command returns the remaining TTL (time to live) of a field that has an expiration set, but in milliseconds instead of seconds.

**Synopsis**

```
HEXPIRETIME key FIELDS numfields field [field ...]
```

Returns the absolute Unix timestamp in seconds since Unix epoch at which the given key's field(s) will expire.

**Synopsis**

```
HPEXPIRETIME key FIELDS numfields field [field ...]
```

`HPEXPIRETIME` has the same semantics as `HEXPIRETIME`, but returns the absolute Unix expiration timestamp in milliseconds since Unix epoch instead of seconds.

This PR introduces new notification events to support field-level expiration:

| Event       | Trigger                                  |
|-------------|-------------------------------------------|
| `hexpire`   | Field expiration was set                  |
| `hexpired`  | Field was deleted due to expiration       |
| `hpersist`  | Expiration was removed from a field       |
| `del`       | Key was deleted after all fields expired  |

Note that we diverge from Redis in the cases we emit hexpired event.
For example:
given the following usecase:
```
HSET myhash f1 v1
(integer) 0
HGETEX myhash EX 0 FIELDS 1 f1
1) "v1"
 HTTL myhash FIELDS 1 f1
1) (integer) -2
```
regarding the keyspace-notifications:
Redis reports:
```
1) "psubscribe"
2) "__keyevent@0__:*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hset"
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hdel" <---------------- note this
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:del"
4) "myhash2"
```

However In our current suggestion, Valkey will emit:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

- Expiration-aware commands (`HSETEX`, `HGETEX`, etc.) are **not propagated as-is**.
- Instead, Valkey rewrites them into equivalent commands like:
  - `HDEL` (for expired fields)
  - `HPEXPIREAT` (for setting absolute expiration)
  - `HPERSIST` (for removing expiration)

This ensures compatibility with replication and AOF while maintaining consistent field-level expiry behavior.

---

| Command Name | QPS Standard | QPS HFE | QPS Diff % | Latency Standard (ms) | Latency HFE (ms) | Latency Diff % |
|--------------|-------------|---------|------------|----------------------|------------------|----------------|
| **One Large Hash Table** |
| HGET | 137988.12 | 138484.97 | +0.36% | 0.951 | 0.949 | -0.21% |
| HSET | 138561.73 | 137343.77 | -0.87% | 0.948 | 0.956 | +0.84% |
| HEXISTS | 139431.12 | 138677.02 | -0.54% | 0.942 | 0.946 | +0.42% |
| HDEL | 140114.89 | 138966.09 | -0.81% | 0.938 | 0.945 | +0.74% |
| **Many Hash Tables (100 fields)** |
| HGET | 136798.91 | 137419.27 | +0.45% | 0.959 | 0.956 | -0.31% |
| HEXISTS | 138946.78 | 139645.31 | +0.50% | 0.946 | 0.941 | -0.52% |
| HGETALL | 42194.09 | 42016.80 | -0.42% | 0.621 | 0.625 | +0.64% |
| HSET | 137230.69 | 137249.53 | +0.01% | 0.959 | 0.958 | -0.10% |
| HDEL | 138985.41 | 138619.34 | -0.26% | 0.948 | 0.949 | +0.10% |
| **Many Hash Tables (1000 fields)** |
| HGET | 135795.77 | 139256.36 | +2.54% | 0.965 | 0.943 | -2.27% |
| HEXISTS | 138121.55 | 137950.06 | -0.12% | 0.951 | 0.952 | +0.10% |
| HGETALL | 5885.81 | 5633.80 | **-4.28%** | 2.690 | 2.841 | **+5.61%** |
| HSET | 137005.08 | 137400.39 | +0.28% | 0.959 | 0.955 | -0.41% |
| HDEL | 138293.45 | 137381.52 | -0.65% | 0.948 | 0.955 | +0.73% |

[ ] Consider extending HSETEX with extra arguments: NX/XX so that it is possible to prevent adding/setting/mutating fields of a non-existent hash
[ ] Avoid loading expired fields when non-preamble RDB is being loaded on primary. This is an optimization in order to reduce loading unnecessary fields (which are expired). This would also require us to propagate the HDEL to the replicas in case of RDBFLAGS_FEED_REPL. Note that it might have to require some refactoring:
1/ propagate the rdbflags and current time to rdbLoadObject. 2/ consider the case of restore and check_rdb etc...
For this reason I would like to avoid this optimizationfor the first drop.

Signed-off-by: Ran Shidlansik <[email protected]>
ranshid added a commit to valkey-io/valkey that referenced this pull request Aug 5, 2025
Closes #640

This PR introduces support for **field-level expiration in Valkey hash types**, making it possible for individual fields inside a hash to expire independently — creating what we call **volatile fields**.
This is just the first out of 3 PRs. The content of this PR focus on enabling the basic ability to set and modify hash fields expiration as well as persistency (AOF+RDB) and defrag.
[The second PR](ranshid#5) introduces the new algorithm (volatile-set) to track volatile hash fields is in the last stages of review. The current implementation in this PR (in volatile-set.h/c) is just s tub implementation and will be replaced by [The second PR](ranshid#5)
[The third PR](ranshid#4) which introduces the active expiration and defragmentation jobs.

For more highlevel design details you can track the RFC PR: valkey-io/valkey-rfc#22.

---

Some highlevel major decisions which are taken as part of this work:
1. We decided to copy the existing Redis API in order to maintain compatibility with existing clients.
2. We decided to avoid introducing lazy-expiration at this point, in order to reduce complexity and rely only on active-expiration for memory reclamation. This will require us to continue to work on improving the active expiration job and potentially consider introduce lazy-expiration support later on.
3. Although different commands which are adding expiration on hash fields are influencing the memory utilization (by allocating more memory for expiration time and metadata) we decided to avoid adding the DENYOOM for these commands (an exception is HSETEX) in order to be better aligned with highlevel keys commands like `expire`
4. Some hash type commands will produce unexpected results:
 - HLEN - will still reflect the number of fields which exists in the hash object (either actually expired or not).
 - HRANDFIELD - in some cases we will not be able to randomly select a field which was not already expired. this case happen in 2 cases: 1/ when we are asked to provide a non-uniq fields (i.e negative count) 2/ when the size of the hash is much bigger than the count and we need to provide uniq results. In both cases it is possible that an empty response will be returned to the caller, even in case there are fields in the hash which are either persistent or not expired.
5. For the case were a field is provided with a zero (0) expiration time or expiration time in the past, it is immediately deleted. We decided that, in order to be aligned with how high level keys are handled, we will emit hexpired keyspace event for that case (instead of hdel). For example:
for the case:
6. We will ALWAYS load hash fields during rdb load. This means that when primary is rebooting with an old snapshot, it will take time to reclaim all the expired fields. However this simplifies the current logic and avoid major refactoring that I suspect will be needed.
```
HSET myhash f1 v1
> 0
HGETEX myhash EX 0 FIELDS 1 f1
> "v1"
HTTL myhash FIELDS 1 f1
>  -2
```

The reported events are:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

This PR also **modularizes and exposes the internal `hashTypeEntry` logic** as a new standalone `entry.c/h` module. This new abstraction handles all aspects of **field–value–expiry encoding** using multiple memory layouts optimized for performance and memory efficiency.

An `entry` is an abstraction that represents a single **field–value pair with optional expiration**. Internally, Valkey uses different memory layouts for compactness and efficiency, chosen dynamically based on size and encoding constraints.

The entry pointer is the field sds. Which make us use an entry just like any sds. We encode the entry layout type
in the field SDS header. Field type SDS_TYPE_5 doesn't have any spare bits to
encode this so we use it only for the first layout type.

Entry with embedded value, used for small sizes. The value is stored as
SDS_TYPE_8. The field can use any SDS type.

Entry can also have expiration timestamp, which is the UNIX timestamp for it to be expired.
For aligned fast access, we keep the expiry timestamp prior to the start of the sds header.

     +----------------+--------------+---------------+
     | Expiration     | field        | value         |
     | 1234567890LL   | hdr "foo" \0 | hdr8 "bar" \0 |
     +-----------------------^-------+---------------+
                             |
                             |
                            entry pointer (points to field sds content)

Entry with value pointer, used for larger fields and values. The field is SDS
type 8 or higher.

     +--------------+-------+--------------+
     | Expiration   | value | field        |
     | 1234567890LL | ptr   | hdr "foo" \0 |
     +--------------+--^----+------^-------+
                       |           |
                       |           |
                       |         entry pointer (points to field sds content)
                       |
                      value pointer = value sds

The `entry.c/h` API provides methods to:
- Create, read, and write and Update field/value/expiration
- Set or clear expiration
- Check expiration state
- Clone or delete an entry

---

This PR introduces **new commands** and extends existing ones to support field expiration:

The proposed API is very much identical to the Redis provided API (Redis 7.4 + 8.0). This is intentionally proposed in order to avoid breaking client applications already opted to use hash items TTL.

**Synopsis**

```
HSETEX key [NX | XX] [FNX | FXX] [EX seconds | PX milliseconds |
  EXAT unix-time-seconds | PXAT unix-time-milliseconds | KEEPTTL]
  FIELDS numfields field value [field value ...]
```

Set the value of one or more fields of a given hash key, and optionally set their expiration time or time-to-live (TTL).

The HSETEX command supports the following set of options:

* `NX` — Only set the fields if the hash object does NOT exist.
* `XX` — Only set the fields if if the hash object doesx exist.
* `FNX` — Only set the fields if none of them already exist.
* `FXX` — Only set the fields if all of them already exist.
* `EX seconds` — Set the specified expiration time in seconds.
* `PX milliseconds` — Set the specified expiration time in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time in seconds at which the fields will expire.
* `PXAT unix-time-milliseconds` — Set the specified Unix time in milliseconds at which the fields will expire.
* `KEEPTTL` — Retain the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `KEEPTTL` options are mutually exclusive.

**Synopsis**

```
HGETEX key [EX seconds | PX milliseconds | EXAT unix-time-seconds |
  PXAT unix-time-milliseconds | PERSIST] FIELDS numfields field
  [field ...]
```

Get the value of one or more fields of a given hash key and optionally set their expiration time or time-to-live (TTL).

The `HGETEX` command supports a set of options:

* `EX seconds` — Set the specified expiration time, in seconds.
* `PX milliseconds` — Set the specified expiration time, in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time at which the fields will expire, in seconds.
* `PXAT unix-time-milliseconds` — Set the specified Unix time at which the fields will expire, in milliseconds.
* `PERSIST` — Remove the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `PERSIST` options are mutually exclusive.

**Synopsis**

```
HEXPIRE key seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

Set an expiration (TTL or time to live) on one or more fields of a given hash key. You must specify at least one field. Field(s) will automatically be deleted from the hash key when their TTLs expire.
Field expirations will only be cleared by commands that delete or overwrite the contents of the hash fields, including `HDEL` and `HSET` commands. This means that all the operations that conceptually *alter* the value stored at a hash key's field without replacing it with a new one will leave the TTL untouched.
You can clear the TTL of a specific field by specifying 0 for the ‘seconds’ argument.
Note that calling `HEXPIRE`/`HPEXPIRE` with a time in the past will result in the hash field being deleted immediately.

The `HEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HEXPIREAT key unix-time-seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

`HEXPIREAT` has the same effect and semantics as `HEXPIRE`, but instead of specifying the number of seconds for the TTL (time to live), it takes an absolute Unix timestamp in seconds since Unix epoch. A timestamp in the past will delete the field immediately.

The `HEXPIREAT` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIRE key milliseconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

This command works like `HEXPIRE`, but the expiration of a field is specified in milliseconds instead of seconds.

The `HPEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIREAT key unix-time-milliseconds [NX | XX | GT | LT]
  FIELDS numfields field [field ...]
```

`HPEXPIREAT` has the same effect and semantics as `HEXPIREAT``,` but the Unix time at which the field will expire is specified in milliseconds since Unix epoch instead of seconds.

**Synopsis**

```
HPERSIST key FIELDS numfields field [field ...]
```

Remove the existing expiration on a hash key's field(s), turning the field(s) from *volatile* (a field with expiration set) to *persistent* (a field that will never expire as no TTL (time to live) is associated).

**Synopsis**

```
HSETEX key [NX] seconds field value [field value ...]
```

Similar to `HSET` but adds one or more hash fields that expire after specified number of seconds. By default, this command overwrites the values and expirations of specified fields that exist in the hash. If `NX` option is specified, the field data will not be overwritten. If `key` doesn't exist, a new Hash key is created.

The HSETEX command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.

**Synopsis**

```
HTTL key FIELDS numfields field [field ...]
```

Returns the **remaining** TTL (time to live) of a hash key's field(s) that have a set expiration. This introspection capability allows you to check how many seconds a given hash field will continue to be part of the hash key.

```
HPTTL key FIELDS numfields field [field ...]
```

Like `HTTL`, this command returns the remaining TTL (time to live) of a field that has an expiration set, but in milliseconds instead of seconds.

**Synopsis**

```
HEXPIRETIME key FIELDS numfields field [field ...]
```

Returns the absolute Unix timestamp in seconds since Unix epoch at which the given key's field(s) will expire.

**Synopsis**

```
HPEXPIRETIME key FIELDS numfields field [field ...]
```

`HPEXPIRETIME` has the same semantics as `HEXPIRETIME`, but returns the absolute Unix expiration timestamp in milliseconds since Unix epoch instead of seconds.

This PR introduces new notification events to support field-level expiration:

| Event       | Trigger                                  |
|-------------|-------------------------------------------|
| `hexpire`   | Field expiration was set                  |
| `hexpired`  | Field was deleted due to expiration       |
| `hpersist`  | Expiration was removed from a field       |
| `del`       | Key was deleted after all fields expired  |

Note that we diverge from Redis in the cases we emit hexpired event.
For example:
given the following usecase:
```
HSET myhash f1 v1
(integer) 0
HGETEX myhash EX 0 FIELDS 1 f1
1) "v1"
 HTTL myhash FIELDS 1 f1
1) (integer) -2
```
regarding the keyspace-notifications:
Redis reports:
```
1) "psubscribe"
2) "__keyevent@0__:*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hset"
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hdel" <---------------- note this
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:del"
4) "myhash2"
```

However In our current suggestion, Valkey will emit:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

- Expiration-aware commands (`HSETEX`, `HGETEX`, etc.) are **not propagated as-is**.
- Instead, Valkey rewrites them into equivalent commands like:
  - `HDEL` (for expired fields)
  - `HPEXPIREAT` (for setting absolute expiration)
  - `HPERSIST` (for removing expiration)

This ensures compatibility with replication and AOF while maintaining consistent field-level expiry behavior.

---

| Command Name | QPS Standard | QPS HFE | QPS Diff % | Latency Standard (ms) | Latency HFE (ms) | Latency Diff % |
|--------------|-------------|---------|------------|----------------------|------------------|----------------|
| **One Large Hash Table** |
| HGET | 137988.12 | 138484.97 | +0.36% | 0.951 | 0.949 | -0.21% |
| HSET | 138561.73 | 137343.77 | -0.87% | 0.948 | 0.956 | +0.84% |
| HEXISTS | 139431.12 | 138677.02 | -0.54% | 0.942 | 0.946 | +0.42% |
| HDEL | 140114.89 | 138966.09 | -0.81% | 0.938 | 0.945 | +0.74% |
| **Many Hash Tables (100 fields)** |
| HGET | 136798.91 | 137419.27 | +0.45% | 0.959 | 0.956 | -0.31% |
| HEXISTS | 138946.78 | 139645.31 | +0.50% | 0.946 | 0.941 | -0.52% |
| HGETALL | 42194.09 | 42016.80 | -0.42% | 0.621 | 0.625 | +0.64% |
| HSET | 137230.69 | 137249.53 | +0.01% | 0.959 | 0.958 | -0.10% |
| HDEL | 138985.41 | 138619.34 | -0.26% | 0.948 | 0.949 | +0.10% |
| **Many Hash Tables (1000 fields)** |
| HGET | 135795.77 | 139256.36 | +2.54% | 0.965 | 0.943 | -2.27% |
| HEXISTS | 138121.55 | 137950.06 | -0.12% | 0.951 | 0.952 | +0.10% |
| HGETALL | 5885.81 | 5633.80 | **-4.28%** | 2.690 | 2.841 | **+5.61%** |
| HSET | 137005.08 | 137400.39 | +0.28% | 0.959 | 0.955 | -0.41% |
| HDEL | 138293.45 | 137381.52 | -0.65% | 0.948 | 0.955 | +0.73% |

[ ] Consider extending HSETEX with extra arguments: NX/XX so that it is possible to prevent adding/setting/mutating fields of a non-existent hash
[ ] Avoid loading expired fields when non-preamble RDB is being loaded on primary. This is an optimization in order to reduce loading unnecessary fields (which are expired). This would also require us to propagate the HDEL to the replicas in case of RDBFLAGS_FEED_REPL. Note that it might have to require some refactoring:
1/ propagate the rdbflags and current time to rdbLoadObject. 2/ consider the case of restore and check_rdb etc...
For this reason I would like to avoid this optimizationfor the first drop.

Signed-off-by: Ran Shidlansik <[email protected]>
allenss-amazon pushed a commit to allenss-amazon/valkey-core that referenced this pull request Aug 19, 2025
Closes valkey-io#640

This PR introduces support for **field-level expiration in Valkey hash types**, making it possible for individual fields inside a hash to expire independently — creating what we call **volatile fields**.
This is just the first out of 3 PRs. The content of this PR focus on enabling the basic ability to set and modify hash fields expiration as well as persistency (AOF+RDB) and defrag.
[The second PR](ranshid#5) introduces the new algorithm (volatile-set) to track volatile hash fields is in the last stages of review. The current implementation in this PR (in volatile-set.h/c) is just s tub implementation and will be replaced by [The second PR](ranshid#5)
[The third PR](ranshid#4) which introduces the active expiration and defragmentation jobs.

For more highlevel design details you can track the RFC PR: valkey-io/valkey-rfc#22.

---

Some highlevel major decisions which are taken as part of this work:
1. We decided to copy the existing Redis API in order to maintain compatibility with existing clients.
2. We decided to avoid introducing lazy-expiration at this point, in order to reduce complexity and rely only on active-expiration for memory reclamation. This will require us to continue to work on improving the active expiration job and potentially consider introduce lazy-expiration support later on.
3. Although different commands which are adding expiration on hash fields are influencing the memory utilization (by allocating more memory for expiration time and metadata) we decided to avoid adding the DENYOOM for these commands (an exception is HSETEX) in order to be better aligned with highlevel keys commands like `expire`
4. Some hash type commands will produce unexpected results:
 - HLEN - will still reflect the number of fields which exists in the hash object (either actually expired or not).
 - HRANDFIELD - in some cases we will not be able to randomly select a field which was not already expired. this case happen in 2 cases: 1/ when we are asked to provide a non-uniq fields (i.e negative count) 2/ when the size of the hash is much bigger than the count and we need to provide uniq results. In both cases it is possible that an empty response will be returned to the caller, even in case there are fields in the hash which are either persistent or not expired.
5. For the case were a field is provided with a zero (0) expiration time or expiration time in the past, it is immediately deleted. We decided that, in order to be aligned with how high level keys are handled, we will emit hexpired keyspace event for that case (instead of hdel). For example:
for the case:
6. We will ALWAYS load hash fields during rdb load. This means that when primary is rebooting with an old snapshot, it will take time to reclaim all the expired fields. However this simplifies the current logic and avoid major refactoring that I suspect will be needed.
```
HSET myhash f1 v1
> 0
HGETEX myhash EX 0 FIELDS 1 f1
> "v1"
HTTL myhash FIELDS 1 f1
>  -2
```

The reported events are:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

This PR also **modularizes and exposes the internal `hashTypeEntry` logic** as a new standalone `entry.c/h` module. This new abstraction handles all aspects of **field–value–expiry encoding** using multiple memory layouts optimized for performance and memory efficiency.

An `entry` is an abstraction that represents a single **field–value pair with optional expiration**. Internally, Valkey uses different memory layouts for compactness and efficiency, chosen dynamically based on size and encoding constraints.

The entry pointer is the field sds. Which make us use an entry just like any sds. We encode the entry layout type
in the field SDS header. Field type SDS_TYPE_5 doesn't have any spare bits to
encode this so we use it only for the first layout type.

Entry with embedded value, used for small sizes. The value is stored as
SDS_TYPE_8. The field can use any SDS type.

Entry can also have expiration timestamp, which is the UNIX timestamp for it to be expired.
For aligned fast access, we keep the expiry timestamp prior to the start of the sds header.

     +----------------+--------------+---------------+
     | Expiration     | field        | value         |
     | 1234567890LL   | hdr "foo" \0 | hdr8 "bar" \0 |
     +-----------------------^-------+---------------+
                             |
                             |
                            entry pointer (points to field sds content)

Entry with value pointer, used for larger fields and values. The field is SDS
type 8 or higher.

     +--------------+-------+--------------+
     | Expiration   | value | field        |
     | 1234567890LL | ptr   | hdr "foo" \0 |
     +--------------+--^----+------^-------+
                       |           |
                       |           |
                       |         entry pointer (points to field sds content)
                       |
                      value pointer = value sds

The `entry.c/h` API provides methods to:
- Create, read, and write and Update field/value/expiration
- Set or clear expiration
- Check expiration state
- Clone or delete an entry

---

This PR introduces **new commands** and extends existing ones to support field expiration:

The proposed API is very much identical to the Redis provided API (Redis 7.4 + 8.0). This is intentionally proposed in order to avoid breaking client applications already opted to use hash items TTL.

**Synopsis**

```
HSETEX key [NX | XX] [FNX | FXX] [EX seconds | PX milliseconds |
  EXAT unix-time-seconds | PXAT unix-time-milliseconds | KEEPTTL]
  FIELDS numfields field value [field value ...]
```

Set the value of one or more fields of a given hash key, and optionally set their expiration time or time-to-live (TTL).

The HSETEX command supports the following set of options:

* `NX` — Only set the fields if the hash object does NOT exist.
* `XX` — Only set the fields if if the hash object doesx exist.
* `FNX` — Only set the fields if none of them already exist.
* `FXX` — Only set the fields if all of them already exist.
* `EX seconds` — Set the specified expiration time in seconds.
* `PX milliseconds` — Set the specified expiration time in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time in seconds at which the fields will expire.
* `PXAT unix-time-milliseconds` — Set the specified Unix time in milliseconds at which the fields will expire.
* `KEEPTTL` — Retain the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `KEEPTTL` options are mutually exclusive.

**Synopsis**

```
HGETEX key [EX seconds | PX milliseconds | EXAT unix-time-seconds |
  PXAT unix-time-milliseconds | PERSIST] FIELDS numfields field
  [field ...]
```

Get the value of one or more fields of a given hash key and optionally set their expiration time or time-to-live (TTL).

The `HGETEX` command supports a set of options:

* `EX seconds` — Set the specified expiration time, in seconds.
* `PX milliseconds` — Set the specified expiration time, in milliseconds.
* `EXAT unix-time-seconds` — Set the specified Unix time at which the fields will expire, in seconds.
* `PXAT unix-time-milliseconds` — Set the specified Unix time at which the fields will expire, in milliseconds.
* `PERSIST` — Remove the TTL associated with the fields.

The `EX`, `PX`, `EXAT`, `PXAT`, and `PERSIST` options are mutually exclusive.

**Synopsis**

```
HEXPIRE key seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

Set an expiration (TTL or time to live) on one or more fields of a given hash key. You must specify at least one field. Field(s) will automatically be deleted from the hash key when their TTLs expire.
Field expirations will only be cleared by commands that delete or overwrite the contents of the hash fields, including `HDEL` and `HSET` commands. This means that all the operations that conceptually *alter* the value stored at a hash key's field without replacing it with a new one will leave the TTL untouched.
You can clear the TTL of a specific field by specifying 0 for the ‘seconds’ argument.
Note that calling `HEXPIRE`/`HPEXPIRE` with a time in the past will result in the hash field being deleted immediately.

The `HEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HEXPIREAT key unix-time-seconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

`HEXPIREAT` has the same effect and semantics as `HEXPIRE`, but instead of specifying the number of seconds for the TTL (time to live), it takes an absolute Unix timestamp in seconds since Unix epoch. A timestamp in the past will delete the field immediately.

The `HEXPIREAT` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIRE key milliseconds [NX | XX | GT | LT] FIELDS numfields
  field [field ...]
```

This command works like `HEXPIRE`, but the expiration of a field is specified in milliseconds instead of seconds.

The `HPEXPIRE` command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.
* `XX` — For each specified field, set expiration only when the field has an existing expiration.
* `GT` — For each specified field, set expiration only when the new expiration is greater than current one.
* `LT` — For each specified field, set expiration only when the new expiration is less than current one.

**Synopsis**

```
HPEXPIREAT key unix-time-milliseconds [NX | XX | GT | LT]
  FIELDS numfields field [field ...]
```

`HPEXPIREAT` has the same effect and semantics as `HEXPIREAT``,` but the Unix time at which the field will expire is specified in milliseconds since Unix epoch instead of seconds.

**Synopsis**

```
HPERSIST key FIELDS numfields field [field ...]
```

Remove the existing expiration on a hash key's field(s), turning the field(s) from *volatile* (a field with expiration set) to *persistent* (a field that will never expire as no TTL (time to live) is associated).

**Synopsis**

```
HSETEX key [NX] seconds field value [field value ...]
```

Similar to `HSET` but adds one or more hash fields that expire after specified number of seconds. By default, this command overwrites the values and expirations of specified fields that exist in the hash. If `NX` option is specified, the field data will not be overwritten. If `key` doesn't exist, a new Hash key is created.

The HSETEX command supports a set of options:

* `NX` — For each specified field, set expiration only when the field has no expiration.

**Synopsis**

```
HTTL key FIELDS numfields field [field ...]
```

Returns the **remaining** TTL (time to live) of a hash key's field(s) that have a set expiration. This introspection capability allows you to check how many seconds a given hash field will continue to be part of the hash key.

```
HPTTL key FIELDS numfields field [field ...]
```

Like `HTTL`, this command returns the remaining TTL (time to live) of a field that has an expiration set, but in milliseconds instead of seconds.

**Synopsis**

```
HEXPIRETIME key FIELDS numfields field [field ...]
```

Returns the absolute Unix timestamp in seconds since Unix epoch at which the given key's field(s) will expire.

**Synopsis**

```
HPEXPIRETIME key FIELDS numfields field [field ...]
```

`HPEXPIRETIME` has the same semantics as `HEXPIRETIME`, but returns the absolute Unix expiration timestamp in milliseconds since Unix epoch instead of seconds.

This PR introduces new notification events to support field-level expiration:

| Event       | Trigger                                  |
|-------------|-------------------------------------------|
| `hexpire`   | Field expiration was set                  |
| `hexpired`  | Field was deleted due to expiration       |
| `hpersist`  | Expiration was removed from a field       |
| `del`       | Key was deleted after all fields expired  |

Note that we diverge from Redis in the cases we emit hexpired event.
For example:
given the following usecase:
```
HSET myhash f1 v1
(integer) 0
HGETEX myhash EX 0 FIELDS 1 f1
1) "v1"
 HTTL myhash FIELDS 1 f1
1) (integer) -2
```
regarding the keyspace-notifications:
Redis reports:
```
1) "psubscribe"
2) "__keyevent@0__:*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hset"
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:hdel" <---------------- note this
4) "myhash2"
1) "pmessage"
2) "__keyevent@0__:*"
3) "__keyevent@0__:del"
4) "myhash2"
```

However In our current suggestion, Valkey will emit:
```
1) "psubscribe"
2) "__keyevent@0__*"
3) (integer) 1
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hset"
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:hexpired" <---------------- note this
4) "myhash"
1) "pmessage"
2) "__keyevent@0__*"
3) "__keyevent@0__:del"
4) "myhash"
```
---

- Expiration-aware commands (`HSETEX`, `HGETEX`, etc.) are **not propagated as-is**.
- Instead, Valkey rewrites them into equivalent commands like:
  - `HDEL` (for expired fields)
  - `HPEXPIREAT` (for setting absolute expiration)
  - `HPERSIST` (for removing expiration)

This ensures compatibility with replication and AOF while maintaining consistent field-level expiry behavior.

---

| Command Name | QPS Standard | QPS HFE | QPS Diff % | Latency Standard (ms) | Latency HFE (ms) | Latency Diff % |
|--------------|-------------|---------|------------|----------------------|------------------|----------------|
| **One Large Hash Table** |
| HGET | 137988.12 | 138484.97 | +0.36% | 0.951 | 0.949 | -0.21% |
| HSET | 138561.73 | 137343.77 | -0.87% | 0.948 | 0.956 | +0.84% |
| HEXISTS | 139431.12 | 138677.02 | -0.54% | 0.942 | 0.946 | +0.42% |
| HDEL | 140114.89 | 138966.09 | -0.81% | 0.938 | 0.945 | +0.74% |
| **Many Hash Tables (100 fields)** |
| HGET | 136798.91 | 137419.27 | +0.45% | 0.959 | 0.956 | -0.31% |
| HEXISTS | 138946.78 | 139645.31 | +0.50% | 0.946 | 0.941 | -0.52% |
| HGETALL | 42194.09 | 42016.80 | -0.42% | 0.621 | 0.625 | +0.64% |
| HSET | 137230.69 | 137249.53 | +0.01% | 0.959 | 0.958 | -0.10% |
| HDEL | 138985.41 | 138619.34 | -0.26% | 0.948 | 0.949 | +0.10% |
| **Many Hash Tables (1000 fields)** |
| HGET | 135795.77 | 139256.36 | +2.54% | 0.965 | 0.943 | -2.27% |
| HEXISTS | 138121.55 | 137950.06 | -0.12% | 0.951 | 0.952 | +0.10% |
| HGETALL | 5885.81 | 5633.80 | **-4.28%** | 2.690 | 2.841 | **+5.61%** |
| HSET | 137005.08 | 137400.39 | +0.28% | 0.959 | 0.955 | -0.41% |
| HDEL | 138293.45 | 137381.52 | -0.65% | 0.948 | 0.955 | +0.73% |

[ ] Consider extending HSETEX with extra arguments: NX/XX so that it is possible to prevent adding/setting/mutating fields of a non-existent hash
[ ] Avoid loading expired fields when non-preamble RDB is being loaded on primary. This is an optimization in order to reduce loading unnecessary fields (which are expired). This would also require us to propagate the HDEL to the replicas in case of RDBFLAGS_FEED_REPL. Note that it might have to require some refactoring:
1/ propagate the rdbflags and current time to rdbLoadObject. 2/ consider the case of restore and check_rdb etc...
For this reason I would like to avoid this optimizationfor the first drop.

Signed-off-by: Ran Shidlansik <[email protected]>
ranshid pushed a commit that referenced this pull request Sep 30, 2025
…y-io#2257)

**Current state**
During `hashtableScanDefrag`, rehashing is paused to prevent entries
from moving, but the scan callback can still delete entries which
triggers `hashtableShrinkIfNeeded`. For example, the
`expireScanCallback` can delete expired entries.

**Issue**
This can cause the table to be resized and the old memory to be freed
while the scan is still accessing it, resulting in the following memory
access violation:

```
[err]: Sanitizer error: =================================================================
==46774==ERROR: AddressSanitizer: heap-use-after-free on address 0x611000003100 at pc 0x0000004704d3 bp 0x7fffcb062000 sp 0x7fffcb061ff0
READ of size 1 at 0x611000003100 thread T0
    #0 0x4704d2 in isPositionFilled /home/gusakovy/Projects/valkey/src/hashtable.c:422
    #1 0x478b45 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1768
    #2 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729
    #3 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402
    #4 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297
    #5 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269
    #6 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577
    #7 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370
    valkey-io#8 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513
    valkey-io#9 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543
    valkey-io#10 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291
    valkey-io#11 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139)
    valkey-io#12 0x452e39 in _start (/local/home/gusakovy/Projects/valkey/src/valkey-server+0x452e39)

0x611000003100 is located 0 bytes inside of 256-byte region [0x611000003100,0x611000003200)
freed by thread T0 here:
    #0 0x7f471a34a1e5 in __interceptor_free (/lib64/libasan.so.4+0xd81e5)
    #1 0x4aefbc in zfree_internal /home/gusakovy/Projects/valkey/src/zmalloc.c:400
    #2 0x4aeff5 in valkey_free /home/gusakovy/Projects/valkey/src/zmalloc.c:415
    #3 0x4707d2 in rehashingCompleted /home/gusakovy/Projects/valkey/src/hashtable.c:456
    #4 0x471b5b in resize /home/gusakovy/Projects/valkey/src/hashtable.c:656
    #5 0x475bff in hashtableShrinkIfNeeded /home/gusakovy/Projects/valkey/src/hashtable.c:1272
    #6 0x47704b in hashtablePop /home/gusakovy/Projects/valkey/src/hashtable.c:1448
    #7 0x47716f in hashtableDelete /home/gusakovy/Projects/valkey/src/hashtable.c:1459
    valkey-io#8 0x480038 in kvstoreHashtableDelete /home/gusakovy/Projects/valkey/src/kvstore.c:847
    valkey-io#9 0x50c12c in dbGenericDeleteWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:490
    valkey-io#10 0x515f28 in deleteExpiredKeyAndPropagateWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:1831
    valkey-io#11 0x516103 in deleteExpiredKeyAndPropagate /home/gusakovy/Projects/valkey/src/db.c:1844
    valkey-io#12 0x6d8642 in activeExpireCycleTryExpire /home/gusakovy/Projects/valkey/src/expire.c:70
    valkey-io#13 0x6d8706 in expireScanCallback /home/gusakovy/Projects/valkey/src/expire.c:139
    valkey-io#14 0x478bd8 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1770
    valkey-io#15 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729
    valkey-io#16 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402
    valkey-io#17 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297
    valkey-io#18 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269
    valkey-io#19 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577
    valkey-io#20 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370
    valkey-io#21 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513
    valkey-io#22 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543
    valkey-io#23 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291
    valkey-io#24 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139)

previously allocated by thread T0 here:
    #0 0x7f471a34a753 in __interceptor_calloc (/lib64/libasan.so.4+0xd8753)
    #1 0x4ae48c in ztrycalloc_usable_internal /home/gusakovy/Projects/valkey/src/zmalloc.c:214
    #2 0x4ae757 in valkey_calloc /home/gusakovy/Projects/valkey/src/zmalloc.c:257
    #3 0x4718fc in resize /home/gusakovy/Projects/valkey/src/hashtable.c:645
    #4 0x475bff in hashtableShrinkIfNeeded /home/gusakovy/Projects/valkey/src/hashtable.c:1272
    #5 0x47704b in hashtablePop /home/gusakovy/Projects/valkey/src/hashtable.c:1448
    #6 0x47716f in hashtableDelete /home/gusakovy/Projects/valkey/src/hashtable.c:1459
    #7 0x480038 in kvstoreHashtableDelete /home/gusakovy/Projects/valkey/src/kvstore.c:847
    valkey-io#8 0x50c12c in dbGenericDeleteWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:490
    valkey-io#9 0x515f28 in deleteExpiredKeyAndPropagateWithDictIndex /home/gusakovy/Projects/valkey/src/db.c:1831
    valkey-io#10 0x516103 in deleteExpiredKeyAndPropagate /home/gusakovy/Projects/valkey/src/db.c:1844
    valkey-io#11 0x6d8642 in activeExpireCycleTryExpire /home/gusakovy/Projects/valkey/src/expire.c:70
    valkey-io#12 0x6d8706 in expireScanCallback /home/gusakovy/Projects/valkey/src/expire.c:139
    valkey-io#13 0x478bd8 in hashtableScanDefrag /home/gusakovy/Projects/valkey/src/hashtable.c:1770
    valkey-io#14 0x4789c2 in hashtableScan /home/gusakovy/Projects/valkey/src/hashtable.c:1729
    valkey-io#15 0x47e3ca in kvstoreScan /home/gusakovy/Projects/valkey/src/kvstore.c:402
    valkey-io#16 0x6d9040 in activeExpireCycle /home/gusakovy/Projects/valkey/src/expire.c:297
    valkey-io#17 0x4859d2 in databasesCron /home/gusakovy/Projects/valkey/src/server.c:1269
    valkey-io#18 0x486e92 in serverCron /home/gusakovy/Projects/valkey/src/server.c:1577
    valkey-io#19 0x4637dd in processTimeEvents /home/gusakovy/Projects/valkey/src/ae.c:370
    valkey-io#20 0x4643e3 in aeProcessEvents /home/gusakovy/Projects/valkey/src/ae.c:513
    valkey-io#21 0x4647ea in aeMain /home/gusakovy/Projects/valkey/src/ae.c:543
    valkey-io#22 0x4a61fc in main /home/gusakovy/Projects/valkey/src/server.c:7291
    valkey-io#23 0x7f471957c139 in __libc_start_main (/lib64/libc.so.6+0x21139)

SUMMARY: AddressSanitizer: heap-use-after-free /home/gusakovy/Projects/valkey/src/hashtable.c:422 in isPositionFilled
Shadow bytes around the buggy address:
  0x0c227fff85d0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff85e0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff85f0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
  0x0c227fff8600: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff8610: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
=>0x0c227fff8620:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff8630: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff8640: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c227fff8650: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c227fff8660: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c227fff8670: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==46774==ABORTING
```


**Solution**
Suggested solution is to also pause auto shrinking during
`hashtableScanDefrag`. I noticed that there was already a
`hashtablePauseAutoShrink` method and `pause_auto_shrink` counter, but
it wasn't actually used in `hashtableShrinkIfNeeded` so I fixed that.

**Testing**
I created a simple tcl test that (most of the times) triggers this
error, but it's a little clunky so I didn't add it as part of the PR:

```
start_server {tags {"expire hashtable defrag"}} {
    test {hashtable scan defrag on expiry} {

        r config set hz 100

        set num_keys 20
        for {set i 0} {$i < $num_keys} {incr i} {
            r set "key_$i" "value_$i"
        }

        for {set j 0} {$j < 50} {incr j} {
            set expire_keys 100
            for {set i 0} {$i < $expire_keys} {incr i} {
                # Short expiry time to ensure they expire quickly
                r psetex "expire_key_${i}_${j}" 100 "expire_value_${i}_${j}"
            }

            # Verify keys are set
            set initial_size [r dbsize]
            assert_equal $initial_size [expr $num_keys + $expire_keys]
            
            after 150
            for {set i 0} {$i < 10} {incr i} {
                r get "expire_key_${i}_${j}"
                after 10
            }
        }

        set remaining_keys [r dbsize]
        assert_equal $remaining_keys $num_keys

        # Verify server is still responsive
        assert_equal [r ping] {PONG}
    } {}
}
```
Compiling with ASAN using `make noopt SANITIZER=address valkey-server`
and running the test causes error above. Applying the fix resolves the
issue.

Signed-off-by: Yakov Gusakov <[email protected]>
ranshid pushed a commit that referenced this pull request Oct 10, 2025
With valkey-io#1401, we introduced additional filters to CLIENT LIST/KILL
subcommand. The intended behavior was to pick the last value of the
filter. However, we introduced memory leak for all the preceding
filters.

Before this change:
```
> CLIENT LIST IP 127.0.0.1 IP 127.0.0.1
id=4 addr=127.0.0.1:37866 laddr=127.0.0.1:6379 fd=10 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=21 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=16989 events=r cmd=client|list user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=49 tot-net-out=0 tot-cmds=0
```
Leak:
```
Direct leak of 11 byte(s) in 1 object(s) allocated from:
    #0 0x7f2901aa557d in malloc (/lib64/libasan.so.4+0xd857d)
    #1 0x76db76 in ztrymalloc_usable_internal /workplace/harkrisp/valkey/src/zmalloc.c:156
    #2 0x76db76 in zmalloc_usable /workplace/harkrisp/valkey/src/zmalloc.c:200
    #3 0x4c4121 in _sdsnewlen.constprop.230 /workplace/harkrisp/valkey/src/sds.c:113
    #4 0x4dc456 in parseClientFiltersOrReply.constprop.63 /workplace/harkrisp/valkey/src/networking.c:4264
    #5 0x4bb9f7 in clientListCommand /workplace/harkrisp/valkey/src/networking.c:4600
    #6 0x641159 in call /workplace/harkrisp/valkey/src/server.c:3772
    #7 0x6431a6 in processCommand /workplace/harkrisp/valkey/src/server.c:4434
    valkey-io#8 0x4bfa9b in processCommandAndResetClient /workplace/harkrisp/valkey/src/networking.c:3571
    valkey-io#9 0x4bfa9b in processInputBuffer /workplace/harkrisp/valkey/src/networking.c:3702
    valkey-io#10 0x4bffa3 in readQueryFromClient /workplace/harkrisp/valkey/src/networking.c:3812
    valkey-io#11 0x481015 in callHandler /workplace/harkrisp/valkey/src/connhelpers.h:79
    valkey-io#12 0x481015 in connSocketEventHandler.lto_priv.394 /workplace/harkrisp/valkey/src/socket.c:301
    valkey-io#13 0x7d3fb3 in aeProcessEvents /workplace/harkrisp/valkey/src/ae.c:486
    valkey-io#14 0x7d4d44 in aeMain /workplace/harkrisp/valkey/src/ae.c:543
    valkey-io#15 0x453925 in main /workplace/harkrisp/valkey/src/server.c:7319
    valkey-io#16 0x7f2900cd7139 in __libc_start_main (/lib64/libc.so.6+0x21139)
```

Note: For filter ID / NOT-ID we group all the option and perform
filtering whereas for remaining filters we only pick the last filter
option.

---------

Signed-off-by: Harkrishn Patro <[email protected]>
ranshid pushed a commit that referenced this pull request Nov 26, 2025
…y-io#2608)

Introduce a new config `hash-seed` which can be set only at startup and
controls the hash seed for the server. This includes all hash tables.
This change makes it so that both primaries and replicas will return the
same results for SCAN/HSCAN/ZSCAN/SSCAN cursors. This is useful in order
to make sure SCAN behaves correctly after a failover.

Resolves #4

---------

Signed-off-by: Sarthak Aggarwal <[email protected]>
Signed-off-by: Sarthak Aggarwal <[email protected]>
Co-authored-by: Viktor Söderqvist <[email protected]>
ranshid pushed a commit that referenced this pull request Nov 26, 2025
With valkey-io#1401, we introduced additional filters to CLIENT LIST/KILL
subcommand. The intended behavior was to pick the last value of the
filter. However, we introduced memory leak for all the preceding
filters.

Before this change:
```
> CLIENT LIST IP 127.0.0.1 IP 127.0.0.1
id=4 addr=127.0.0.1:37866 laddr=127.0.0.1:6379 fd=10 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=21 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=16989 events=r cmd=client|list user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=49 tot-net-out=0 tot-cmds=0
```
Leak:
```
Direct leak of 11 byte(s) in 1 object(s) allocated from:
    #0 0x7f2901aa557d in malloc (/lib64/libasan.so.4+0xd857d)
    #1 0x76db76 in ztrymalloc_usable_internal /workplace/harkrisp/valkey/src/zmalloc.c:156
    #2 0x76db76 in zmalloc_usable /workplace/harkrisp/valkey/src/zmalloc.c:200
    #3 0x4c4121 in _sdsnewlen.constprop.230 /workplace/harkrisp/valkey/src/sds.c:113
    #4 0x4dc456 in parseClientFiltersOrReply.constprop.63 /workplace/harkrisp/valkey/src/networking.c:4264
    #5 0x4bb9f7 in clientListCommand /workplace/harkrisp/valkey/src/networking.c:4600
    #6 0x641159 in call /workplace/harkrisp/valkey/src/server.c:3772
    #7 0x6431a6 in processCommand /workplace/harkrisp/valkey/src/server.c:4434
    valkey-io#8 0x4bfa9b in processCommandAndResetClient /workplace/harkrisp/valkey/src/networking.c:3571
    valkey-io#9 0x4bfa9b in processInputBuffer /workplace/harkrisp/valkey/src/networking.c:3702
    valkey-io#10 0x4bffa3 in readQueryFromClient /workplace/harkrisp/valkey/src/networking.c:3812
    valkey-io#11 0x481015 in callHandler /workplace/harkrisp/valkey/src/connhelpers.h:79
    valkey-io#12 0x481015 in connSocketEventHandler.lto_priv.394 /workplace/harkrisp/valkey/src/socket.c:301
    valkey-io#13 0x7d3fb3 in aeProcessEvents /workplace/harkrisp/valkey/src/ae.c:486
    valkey-io#14 0x7d4d44 in aeMain /workplace/harkrisp/valkey/src/ae.c:543
    valkey-io#15 0x453925 in main /workplace/harkrisp/valkey/src/server.c:7319
    valkey-io#16 0x7f2900cd7139 in __libc_start_main (/lib64/libc.so.6+0x21139)
```

Note: For filter ID / NOT-ID we group all the option and perform
filtering whereas for remaining filters we only pick the last filter
option.

---------

Signed-off-by: Harkrishn Patro <[email protected]>
(cherry picked from commit 155b0bb)
Signed-off-by: cherukum-amazon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants