Skip to content

This is the code repository of paper "LightDSA: Enabling Efficient DSA Through Hardware-Aware Transparent Optimization"

License

Notifications You must be signed in to change notification settings

izumihanako/LightDSA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LightDSA: Enabling Efficient DSA Through Hardware-Aware Transparent Optimization

LightDSA is a user-friendly, asynchronous DSA library featuring low-overhead, run-to-completion optimizations. It is designed to eliminate the negative performance impact of DSA features, enabling DSA to run efficiently under any workload.

More information about LightDSA can be found in our EuroSys '26 paper.

Please kindly cite our paper if you find this project is useful:

@inproceedings {lightdsa-eurosys,
  author       = {Yuansen Wang, Teng Ma, Yuanhui Luo, Dongbiao He, Zheng Liu and Yunpeng Chai},
  title        = {LightDSA: Enabling Efficient DSA Through Hardware-Aware Transparent Optimization},
  booktitle    = {Eurosys},
  year         = {2026},
  publisher    = {ACM},
}

Overview

The key directories of this project and their contents are as follows:

LightDSA 
├── AE              # Scripts for artifact evaluation of the paper
│   ├── figure1         # Scripts for reproduce Figure 1
│   ├── figure3         # Scripts for reproduce Figure 3
│   ├── ...
│   └── ATCexplore      # Scripts for reproduce ATC structure exploration
├── example         # Simple example programs using LightDSA
├── expr            # Source code for the paper's experiments
├── Makefile        # Builds both static and shared libraries
├── scripts         # Utility scripts
├── src             # LightDSA source code
├── build.sh        # Script for building LightDSA
└── prerequisite.sh # Script for installing dependencies

Dependencies

To reproduce the experiments on a custom machine, ensure the following requirements are met:

Hardware Dependencies

  • Intel Xeon CPU, 4th gen or higher (only these CPUs integrate the DSA accelerator).

Software Dependencies

  • Linux Kernel version ≥ 5.19 (Ubuntu 20.04.6 LTS with Linux 6.6.58 on the provided server)

  • Python 3.10.12 (most of the python libraries are for AE scripts)

    • Required Python libraries: brokenaxes, datasets, huggingface_hub, matplotlib, numpy, pandas, redis, tqdm
  • idxd-config from Intel repository

  • libnuma, libpmem (available from package manager)

  • CMake version ≥ 3.16 (CMake 3.16.3 on the provided server)

  • Any C++ compiler supporting CXX14 (g++ 11.4.0 on the provided server)

Build & Reproduce

First, install the required dependencies by running:

./prerequisite.sh

This script installs python3-pip, libnuma, libpmem, and CMake via the package manager, installs the required Python libraries via pip, and clones and installs the idxd-config from Intel's official repository.

To build LightDSA, run:

./build.sh

We also provide a end-to-end script to reproduce all experiments:

cd AE && ./reproduce.sh 

The figures for each experiment will be generated in their corresponding figureX directories (X = 1, 3, 4, 5, 6, 7, 8, 9, 12, 13, 15). If you prefer to run a specific experiment, each directory contains a runner.sh script to reproduce the experiment and generate the figure. For example, to reproduce Figure 1:

cd AE/figure1 && ./runner.sh

This will generate figure1.pdf in the same directory, corresponding to Figure 1 in the paper.

For more details on reproduction, see the README.md in the LightDSA-AE repository.

As a Library

After building, a shared library liblightDSA.so will be available in the build directory.

If you prefer to use the Makefile directly, we also provide a Makefile in the project root directory. Simply run:

make

and you will find both liblightDSA.so and liblightDSA.a in the lib directory.

Configuration of LightDSA

The configuration of LightDSA is located in the header file src/details/dsa_conf.hpp. All configurable options are defined as macros, with names and comments consistent with those in the paper. You can disable an optimization or option by commenting the corresponding line. For example, consider the following two lines in the configuration:

#define DESCS_OUT_OF_ORDER_RECYCLE_ENABLE        /*** use Out-of-Order recycle ***/  
constexpr int OUT_OF_ORDER_RECYCLE_T_INIT = 25 ; /*** T_init value ***/

If the first line is commented, the out-of-order recycle optimization is disabled. The second line sets $t_{init}$ to 25. However, if the first line is commented, this definition has no effect. For details on out-of-order recycle and the meaning of $t_{init}$, please refer to Section 4.5 of the LightDSA paper.

Why Determine All Configurations at Compile Time?

LightDSA determines all configurations at compile time rather than at run time. This maximizes performance and simplifies the API. Each descriptor contains a flags field whose value depends on the configuration. If configurations are determined at run time, the API would need extra knobs, and every descriptor initialization would perform extra computations to determine the flags value. This extra computation overhead is especially noticeable for short-running operations. By computing the flags values at compile time, LightDSA reduces the initialization overhead of the flags field to a single constant assignment, thereby ensuring minimal overhead and simplifying the API.

API and Code Examples

We provide "Hello World" examples in the example directory for both C (example_c.c) and C++ (example_cpp.cpp), each with detailed comments.

To run these examples, first build LightDSA:

./build.sh

Next, go to the build folder and manually setup DSA:

cd build
sudo ./setup_dsa.sh -d dsa0 -w 1 -m s -e 1 -f 1

Then, you can run the Hello World example:

sudo ./example/example_c # or sudo ./example/example_cpp

You will see output, including the DSA configuration on the machine and the status of various LightDSA optimizations.

All APIs are declared in the header files under src/interfaces. For use, simply include src/lightdsa.hpp (C++) or src/lightdsa_c.h (C). APIs are organized into four categories based on (1) synchronicity (synchronous vs. asynchronous) and (2) submission mode (batch vs. single operation).

Batch Submission

Batch submission is always asynchronous. See dsa_batch.hpp for the detailed interfaces.

Before submitting operations, create a DSAbatch object, e.g., DSAbatch batch.

Operation type Interface
memmove (async) batch.submit_memmove(void *dest, void *src, size_t len)
memfill (async) batch.submit_memfill(void *dest, uint64_t pattern, size_t len)
flush (async) batch.submit_flush(void *dest, size_t len)
noop (async) batch.submit_noop()
check batch.check()
wait batch.waut()
  • batch.submit_memmove copies len bytes from src to dest.
  • batch.submit_memfill fills len bytes at dest by repeating the 8-byte pattern.
  • batch.submit_flush flushes len bytes starting at dest from CPU caches back to memory.
  • batch.submit_noop is a no-op that do nothing.
  • batch.check returns 1 if all submitted operations have completed, otherwise 0.
  • batch.wait blocks until all submitted operations complete

All operations execute asynchronously and may not start immediately upon submission. Use batch.wait() to wait for completion. Execution order is not guaranteed to follow submission order; if strict ordering is required, call batch.wait() as needed (note this can significantly reduce performance).

Note: Batch APIs for compare and comp_pattern are not yet provided. Because the Intra-Batch Descriptors Mixing policy may reorder operations within a batch, retrieving comparison results requires an intermediate layer to track submission order. We plan to add this functionality in a future update.

Single-Operation Submission

See dsa_op.hpp for the detailed interfaces.

Before submitting operations, create a DSAop object, e.g., DSAop dsaop.

Operation type Interface
memmove dsaop.sync_memmove(void *dest, void *src, size_t len)
memmove (async) dsaop.async_memmove(void *dest, void *src, size_t len)
memfill dsaop.sync_memfill(void *dest, uint64_t pattern, size_t len)
memfill (async) dsaop.async_memfill(void *dest, uint64_t pattern, size_t len)
compare dsaop.sync_compare(void *dest, void *src, size_t len)
compare (async) dsaop.async_compare(void *dest, void *src, size_t len)
comp_pattern dsaop.sync_comp_pattern(void *src, uint64_t pattern, size_t len)
comp_pattern (async) dsaop.async_comp_pattern(void *src, uint64_t pattern, size_t len)
flush dsaop.sync_flush(void *dest, size_t len)
flush (async) dsaop.async_flush(void *dest, size_t len)
wait dsaop.wait()
check dsaop.check()
compare_res dsaop.compare_res()
compare_offset dsaop.compare_differ_offset()
  • Operations that also exist in the batch API have the same semantics here. Methods prefixed with sync_ are synchronous (equivalent to calling the corresponding async_ method followed by wait())
  • dsaop.sync_compare compares len bytes at dest and src for equality.
  • dsaop.sync_comp_pattern checks whether len bytes at src match the 8-byte pattern.
  • dsaop.compare_res returns the comparison result: 0 if the regions match, 1 if they differ.
  • dsaop.compare_differ_offset returns the byte offset of the first difference; if the regions match, it returns len.

Reference

We use some scripts from Intel dsa-perf-micros repository.

About

This is the code repository of paper "LightDSA: Enabling Efficient DSA Through Hardware-Aware Transparent Optimization"

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •