Currently, the implementation is limited to fixed-size segments that can be processed by one worker per segment (e.g., one thread block or one warp). We want to extend support for variable-size segments, where users guarantee an upper bound on the segment size that still fits within the resources for a single worker (i.e., shared memory of a thread block for that tile size).