LightDSA is a user-friendly, asynchronous DSA library featuring low-overhead, run-to-completion optimizations. It is designed to eliminate the negative performance impact of DSA features, enabling DSA to run efficiently under any workload.
More information about LightDSA can be found in our EuroSys '26 paper.
Please kindly cite our paper if you find this project is useful:
@inproceedings {lightdsa-eurosys,
author = {Yuansen Wang, Teng Ma, Yuanhui Luo, Dongbiao He, Zheng Liu and Yunpeng Chai},
title = {LightDSA: Enabling Efficient DSA Through Hardware-Aware Transparent Optimization},
booktitle = {Eurosys},
year = {2026},
publisher = {ACM},
}
The key directories of this project and their contents are as follows:
LightDSA
├── AE # Scripts for artifact evaluation of the paper
│ ├── figure1 # Scripts for reproduce Figure 1
│ ├── figure3 # Scripts for reproduce Figure 3
│ ├── ...
│ └── ATCexplore # Scripts for reproduce ATC structure exploration
├── example # Simple example programs using LightDSA
├── expr # Source code for the paper's experiments
├── Makefile # Builds both static and shared libraries
├── scripts # Utility scripts
├── src # LightDSA source code
├── build.sh # Script for building LightDSA
└── prerequisite.sh # Script for installing dependenciesTo reproduce the experiments on a custom machine, ensure the following requirements are met:
- Intel Xeon CPU, 4th gen or higher (only these CPUs integrate the DSA accelerator).
-
Linux Kernel version ≥ 5.19 (Ubuntu 20.04.6 LTS with Linux 6.6.58 on the provided server)
-
Python 3.10.12 (most of the python libraries are for AE scripts)
- Required Python libraries: brokenaxes, datasets, huggingface_hub, matplotlib, numpy, pandas, redis, tqdm
-
idxd-configfrom Intel repository -
libnuma, libpmem (available from package manager)
-
CMake version ≥ 3.16 (CMake 3.16.3 on the provided server)
-
Any C++ compiler supporting CXX14 (g++ 11.4.0 on the provided server)
First, install the required dependencies by running:
./prerequisite.shThis script installs python3-pip, libnuma, libpmem, and CMake via the package manager, installs the required Python libraries via pip, and clones and installs the idxd-config from Intel's official repository.
To build LightDSA, run:
./build.shWe also provide a end-to-end script to reproduce all experiments:
cd AE && ./reproduce.sh The figures for each experiment will be generated in their corresponding figureX directories (X = 1, 3, 4, 5, 6, 7, 8, 9, 12, 13, 15).
If you prefer to run a specific experiment, each directory contains a runner.sh script to reproduce the experiment and generate the figure. For example, to reproduce Figure 1:
cd AE/figure1 && ./runner.shThis will generate figure1.pdf in the same directory, corresponding to Figure 1 in the paper.
For more details on reproduction, see the README.md in the LightDSA-AE repository.
After building, a shared library liblightDSA.so will be available in the build directory.
If you prefer to use the Makefile directly, we also provide a Makefile in the project root directory. Simply run:
makeand you will find both liblightDSA.so and liblightDSA.a in the lib directory.
The configuration of LightDSA is located in the header file src/details/dsa_conf.hpp. All configurable options are defined as macros, with names and comments consistent with those in the paper. You can disable an optimization or option by commenting the corresponding line. For example, consider the following two lines in the configuration:
#define DESCS_OUT_OF_ORDER_RECYCLE_ENABLE /*** use Out-of-Order recycle ***/
constexpr int OUT_OF_ORDER_RECYCLE_T_INIT = 25 ; /*** T_init value ***/If the first line is commented, the out-of-order recycle optimization is disabled.
The second line sets
LightDSA determines all configurations at compile time rather than at run time. This maximizes performance and simplifies the API. Each descriptor contains a flags field whose value depends on the configuration. If configurations are determined at run time, the API would need extra knobs, and every descriptor initialization would perform extra computations to determine the flags value. This extra computation overhead is especially noticeable for short-running operations. By computing the flags values at compile time, LightDSA reduces the initialization overhead of the flags field to a single constant assignment, thereby ensuring minimal overhead and simplifying the API.
We provide "Hello World" examples in the example directory for both C (example_c.c) and C++ (example_cpp.cpp), each with detailed comments.
To run these examples, first build LightDSA:
./build.shNext, go to the build folder and manually setup DSA:
cd build
sudo ./setup_dsa.sh -d dsa0 -w 1 -m s -e 1 -f 1Then, you can run the Hello World example:
sudo ./example/example_c # or sudo ./example/example_cppYou will see output, including the DSA configuration on the machine and the status of various LightDSA optimizations.
All APIs are declared in the header files under src/interfaces. For use, simply include src/lightdsa.hpp (C++) or src/lightdsa_c.h (C). APIs are organized into four categories based on (1) synchronicity (synchronous vs. asynchronous) and (2) submission mode (batch vs. single operation).
Batch submission is always asynchronous. See dsa_batch.hpp for the detailed interfaces.
Before submitting operations, create a DSAbatch object, e.g., DSAbatch batch.
| Operation type | Interface |
|---|---|
| memmove (async) | batch.submit_memmove(void *dest, void *src, size_t len) |
| memfill (async) | batch.submit_memfill(void *dest, uint64_t pattern, size_t len) |
| flush (async) | batch.submit_flush(void *dest, size_t len) |
| noop (async) | batch.submit_noop() |
| check | batch.check() |
| wait | batch.waut() |
batch.submit_memmovecopieslenbytes fromsrctodest.batch.submit_memfillfillslenbytes atdestby repeating the 8-bytepattern.batch.submit_flushflusheslenbytes starting atdestfrom CPU caches back to memory.batch.submit_noopis a no-op that do nothing.batch.checkreturns1if all submitted operations have completed, otherwise0.batch.waitblocks until all submitted operations complete
All operations execute asynchronously and may not start immediately upon submission. Use batch.wait() to wait for completion. Execution order is not guaranteed to follow submission order; if strict ordering is required, call batch.wait() as needed (note this can significantly reduce performance).
Note: Batch APIs for compare and comp_pattern are not yet provided. Because the Intra-Batch Descriptors Mixing policy may reorder operations within a batch, retrieving comparison results requires an intermediate layer to track submission order. We plan to add this functionality in a future update.
See dsa_op.hpp for the detailed interfaces.
Before submitting operations, create a DSAop object, e.g., DSAop dsaop.
| Operation type | Interface |
|---|---|
| memmove | dsaop.sync_memmove(void *dest, void *src, size_t len) |
| memmove (async) | dsaop.async_memmove(void *dest, void *src, size_t len) |
| memfill | dsaop.sync_memfill(void *dest, uint64_t pattern, size_t len) |
| memfill (async) | dsaop.async_memfill(void *dest, uint64_t pattern, size_t len) |
| compare | dsaop.sync_compare(void *dest, void *src, size_t len) |
| compare (async) | dsaop.async_compare(void *dest, void *src, size_t len) |
| comp_pattern | dsaop.sync_comp_pattern(void *src, uint64_t pattern, size_t len) |
| comp_pattern (async) | dsaop.async_comp_pattern(void *src, uint64_t pattern, size_t len) |
| flush | dsaop.sync_flush(void *dest, size_t len) |
| flush (async) | dsaop.async_flush(void *dest, size_t len) |
| wait | dsaop.wait() |
| check | dsaop.check() |
| compare_res | dsaop.compare_res() |
| compare_offset | dsaop.compare_differ_offset() |
- Operations that also exist in the batch API have the same semantics here. Methods prefixed with
sync_are synchronous (equivalent to calling the correspondingasync_method followed bywait()) dsaop.sync_comparecompareslenbytes atdestandsrcfor equality.dsaop.sync_comp_patternchecks whetherlenbytes atsrcmatch the 8-bytepattern.dsaop.compare_resreturns the comparison result:0if the regions match,1if they differ.dsaop.compare_differ_offsetreturns the byte offset of the first difference; if the regions match, it returnslen.
We use some scripts from Intel dsa-perf-micros repository.