VSort: Advanced Sorting Algorithm Optimized for Apple Silicon

VSort is a high-performance sorting library designed to leverage the unique architecture of Apple Silicon processors. By intelligently utilizing ARM NEON vector instruction detection, Grand Central Dispatch for parallelism, and adaptive algorithm selection based on hardware characteristics, VSort aims to deliver exceptional sorting performance.

Author: Davide Santangelo

What's New in 1.0.0

Unified runtime API — configure behaviour through vsort_options_t, feature flags, and new helper utilities like vsort_sort, vsort_default_flags, and vsort_set_default_flags.
Hybrid sorting core — adaptive introsort with heapsort fallback, fast insertion sort for nearly-sorted data, and optional LSD radix sort for large integer workloads.
Stable + specialised paths — opt into stable merge sort via VSORT_FLAG_FORCE_STABLE and benefit from a counting-sort fast path for byte arrays.
Optimised parallelism — redesigned Apple Silicon pipeline that chunk-sorts in parallel and performs batched merge passes using Grand Central Dispatch.
Thread-safe initialisation — hardware calibration now happens exactly once via atomic guards, ensuring safe use from multi-threaded applications.

Features & Optimizations

Apple Silicon Optimizations

VSort's optimizations are designed to maximize performance on Apple Silicon, with the following key aspects:

ARM NEON SIMD Vectorization:
ARM NEON allows for 128-bit vector registers, enabling simultaneous processing of multiple elements. This is particularly effective for sorting large datasets, as vectorized operations can reduce the time complexity of partitioning and comparison steps.
Grand Central Dispatch (GCD) Integration:
GCD, Apple's task scheduling system, is used for parallelizing sorting tasks across multiple cores. This is crucial for leveraging Apple Silicon's multi-core architecture, distributing work to both P-cores and E-cores.
Performance & Efficiency Core Awareness:
VSort intelligently detects and utilizes both Performance (P) and Efficiency (E) cores on Apple Silicon chips, assigning workloads appropriately for optimal speed and power efficiency. Complex, disordered chunks are processed on high-performance cores while simpler, more ordered chunks are sent to efficiency cores for better overall throughput and power usage.
Cache-Optimized Memory Access:
VSort uses adaptive chunk sizing based on cache characteristics, with optimal chunk sizes for L2 cache on Apple Silicon (typically 128KB per core). This minimizes cache misses and improves throughput.
Branch Prediction Optimization:
Sorting algorithms often suffer from branch mispredictions, especially in quicksort. VSort reduces these by optimizing branch-heavy operations.
Adaptive Algorithm Selection:
VSort selects algorithms based on array size and data patterns, using insertion sort for small arrays (threshold around 16 elements) and quicksort for larger ones, with parallel processing for very large arrays.

Recent Improvements

Recent updates have further enhanced VSort's reliability and performance:

Configurable runtime API: Every entry point now funnels through vsort_sort, enabling per-call tuning and default policy overrides.
Hybrid algorithm engine: Adaptive introsort with heapsort fallback, nearly-sorted detection, and optional stable merge sort mode.
Specialised fast paths: Counting sort accelerates byte arrays, while tuned LSD radix sort speeds up large integer ranges.
Thread-safe initialisation: Hardware calibration leverages atomic guards, making the library safe in multi-threaded contexts.
Parallel pipeline refresh: Apple Silicon builds now chunk-sort and merge in parallel with reduced allocation pressure.

Key Technical Features

Hybrid Approach: Combines multiple sorting algorithms (Insertion, Quick, Radix, Parallel Merge Sort structure)
Hardware Detection: Runtime detection of cores, cache sizes, NEON support
Auto-Calibration: Dynamically sets internal thresholds based on detected hardware
Parallelized Sorting (GCD): Distributes initial chunk sorting and merge passes across multiple cores for large arrays on Apple platforms
Optimized Quicksort: Iterative implementation with median-of-three pivot
Optimized Insertion Sort: Standard implementation, effective for small/nearly-sorted data
LSD Radix Sort: Efficient implementation for large integer arrays, handles negative numbers

Current Implementation Status

Feature	Status	Notes
NEON Vectorization	Planned / Partially (Detection only)	Header included, detection present. Merge/Partition need implementation.
P/E Core Detection	Implemented	Detects P/E cores via sysctl.
P/E Core Workload Optimization	Simplified (QoS based) / Planned	Relies on GCD QoS; complex heuristic distribution removed.
Dynamic Threshold Adjustment	Implemented	Auto-calibrates thresholds based on detected hardware.
Grand Central Dispatch	Implemented	Used for parallel sort and parallel merge dispatch.
Adaptive Algorithm Selection	Implemented	Switches between Insertion, Quick, Radix based on size/data.
Cache Optimization	Implemented (Thresholds/Chunking)	Thresholds and parallel chunk size influenced by cache info.
Parallel Merge	Implemented (Parallel Dispatch) / Incomplete	Merge calls are parallelized; internal merge logic is sequential.
Branch Prediction Optimization	Planned	To be investigated.

Parallel Workload Management

Work distribution based on chunk complexity to P-cores and E-cores
Work-stealing queue structure for better load balancing
Balanced binary tree approach for parallel merging
Adaptive chunk sizing that balances cache efficiency and parallelism
Optimized thread count allocation based on array size and core types
Vectorized merge operations using NEON when possible

Performance

Performance Characteristics

The latest benchmark results on Apple Silicon (M4) show impressive performance:

Array Size     Random (ms)    Nearly Sorted (ms) Reverse (ms)   
----------------------------------------------------------------
1000           0.06           0.03               0.02           
10000          0.72           0.32               0.26           
100000         2.74           1.29               0.50           
1000000        13.93          4.81               3.15

VSort demonstrates:

Near-instantaneous sorting for small arrays (<10K elements)
Excellent performance for already sorted or reverse-sorted data
Up to 4.4x speedup for reverse-sorted data compared to random data
Efficient scaling from small to large array sizes

Algorithm Comparison

Compared to standard library qsort and basic textbook implementations of quicksort/mergesort, VSort is expected to offer significant advantages on Apple Silicon due to its hardware-specific optimizations and parallelism, especially for larger datasets. However, performance relative to highly optimized standard library sorts (like C++ std::sort, which often uses Introsort) requires careful benchmarking on the target machine.

┌────────────┬─────────────────┬────────────────┬────────────────┬─────────────────┐
│ Size       │ vsort (ms)      │ quicksort (ms) │ mergesort (ms) │ std::qsort (ms) │
├────────────┼─────────────────┼────────────────┼────────────────┼─────────────────┤
│ 10,000     │ 0.33            │ 0.36           │ 0.33           │ 0.48            │
│            │ (baseline)      │ (1.09×)        │ (1.00×)        │ (1.46×)         │
├────────────┼─────────────────┼────────────────┼────────────────┼─────────────────┤
│ 100,000    │ 1.20            │ 4.14           │ 3.62           │ 5.23            │
│            │ (baseline)      │ (3.45×)        │ (3.02×)        │ (4.36×)         │
├────────────┼─────────────────┼────────────────┼────────────────┼─────────────────┤
│ 1,000,000  │ 10.09           │ 44.87          │ 39.81          │ 59.88           │
│            │ (baseline)      │ (4.45×)        │ (3.95×)        │ (5.94×)         │
└────────────┴─────────────────┴────────────────┴────────────────┴─────────────────┘

VSort provides:

Dramatic performance improvements over traditional algorithms, especially for large datasets
Up to 5.94× faster than standard library sorting functions
Performance parity with mergesort for small arrays, but significantly better with larger data
Exceptional scaling advantage as dataset size increases

Benchmark Results

Benchmark with 10,000 random integers:

Algorithm       | Avg Time (ms)   | Min Time (ms)   | Verification   
------------------|-----------------|-----------------|-----------------
vsort           | 0.581           | 0.538           | PASSED         
quicksort       | 0.570           | 0.557           | PASSED         
mergesort       | 0.558           | 0.545           | PASSED         
std_sort        | 0.783           | 0.754           | PASSED

For small arrays (10,000 elements), all sorting algorithms perform similarly well, with mergesort showing a slight advantage in average time. VSort remains competitive with custom implementations while still outperforming the standard library sort.

Benchmark with 1,000,000 random integers:

Algorithm       Avg Time (ms)   Min Time (ms)   Verification   
--------------------------------------------------------------
vsort           38.46           36.27           PASSED         
quicksort       45.33           45.14           PASSED         
mergesort       39.68           39.42           PASSED         
std::sort       60.70           60.46           PASSED

With larger arrays (1,000,000 elements), VSort shows excellent minimum times (36.27ms), significantly better than all other algorithms including mergesort's best time (39.42ms). This demonstrates that VSort's optimizations become more impactful as data size increases, providing superior performance for large datasets.

Large Array Performance

VSort performs exceptionally well with large arrays:

Large Array Test
----------------
Attempting with 2000000 elements... SUCCESS
Initializing array... DONE
Sorting 2000000 elements... DONE (6.36 ms)
Verifying (sampling)... PASSED

Usage

#include "vsort.h"

// Sort an array of integers
int array[] = {5, 2, 9, 1, 5, 6};
int size = sizeof(array) / sizeof(array[0]);
vsort(array, size);

System Requirements

Recommended: Apple Silicon Mac (M1/M2/M3/M4) running macOS 11+
Compatible: Any modern UNIX-like system with a C compiler
Dependencies: Standard C libraries only (no external dependencies)

Building the Library

mkdir -p build && cd build
cmake -S .. -B .
cmake --build .

CMake automatically detects your hardware and applies appropriate optimizations:

On Apple Silicon, NEON vector instructions and GCD optimizations are enabled
OpenMP parallelization is used when available (install GCC or LLVM with OpenMP for best results)
Standard optimizations are applied on other platforms

Running Tests and Benchmarks

# From the build directory, run all tests
ctest

# Run specific tests
./tests/test_basic         # Basic functionality tests
./tests/test_performance   # Performance benchmark tests
./tests/test_apple_silicon # Tests specific to Apple Silicon

# Run the standard benchmark with custom parameters
./examples/benchmark --size 1000000 --algorithms "vsort,quicksort,mergesort,std::sort"

# Run the Apple Silicon specific benchmark
./examples/apple_silicon_test

Examples

The project includes several example programs demonstrating different use cases:

Basic Examples

basic_example.c: Simple demonstration of sorting an integer array
float_sorting_example.c: Shows how to sort floating-point numbers
char_sorting_example.c: Demonstrates sorting character arrays

Advanced Examples

custom_comparator_example.c: Shows how to use a custom comparator function
struct_sorting_example.c: Demonstrates sorting structures based on different fields
performance_benchmark.c: Benchmarks vsort against standard library sorting
apple_silicon_test.c: Tests optimizations specific to Apple Silicon

Technical Details

Latest Optimization Highlights

The latest version of VSort includes several key optimizations:

Adaptive algorithm engine: Hybrid introsort with heapsort fallback, insertion sort for small or nearly-sorted ranges, and optional LSD radix sort for large integer data sets.
Configurable runtime: Feature flags and the vsort_options_t struct let callers toggle parallelism, radix sorting, and stable ordering per invocation.
Thread-safe hardware calibration: Atomic guards guarantee that cache and core detection occurs exactly once, even under concurrent first-use scenarios.
Parallel pipeline refresh: Apple Silicon builds chunk-sort in parallel and merge via batched dispatch passes, reducing memory churn and improving throughput.
Specialised fast paths: Stable merge sort activates via VSORT_FLAG_FORCE_STABLE, while byte arrays use a dedicated counting sort.
Cache-aware thresholds: L1/L2 cache data inform insertion, merge, and parallel thresholds for better locality.
Detailed logging: Consolidated runtime logging provides insight into detected hardware, calibrated thresholds, and fallback decisions.
Cross-platform resilience: Improved guardrails prevent radix overflow and ensure deterministic ordering for custom comparators.

Computational Complexity

VSort is based on an optimized hybrid sorting algorithm with the following complexity characteristics:

Time Complexity:
- Best Case: O(n log n) - When the array is already nearly sorted
- Average Case: O(n log n) - Expected performance for random input
- Worst Case: O(n²) - Mitigated by median-of-three pivot selection
Space Complexity:
- O(log n) - Iterative implementation uses a stack for managing partitions
- O(1) additional memory for in-place sorting operations

While the asymptotic complexity matches traditional quicksort, VSort's optimization techniques significantly improve performance constants.

Performance Tuning

VSort automatically optimizes for:

Hardware detection: Identifies CPU model, cache sizes, and core configuration
Array size: Different algorithms for small vs. large arrays with auto-calibrated thresholds
Data patterns: Optimizations for sorted or nearly-sorted data
Hardware capabilities: Adaptation to available cores and vector units
Memory constraints: Balance between memory usage and speed

VSort's dynamic threshold adjustment means that the library works optimally without manual configuration, but advanced users can still override settings if needed.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github/workflows		.github/workflows
examples		examples
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
build_script.sh		build_script.sh
vsort.c		vsort.c
vsort.h		vsort.h
vsort_logger.c		vsort_logger.c
vsort_logger.h		vsort_logger.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VSort: Advanced Sorting Algorithm Optimized for Apple Silicon

Table of Contents

What's New in 1.0.0

Features & Optimizations

Apple Silicon Optimizations

Recent Improvements

Key Technical Features

Current Implementation Status

Parallel Workload Management

Performance

Performance Characteristics

Algorithm Comparison

Benchmark Results

Benchmark with 10,000 random integers:

Benchmark with 1,000,000 random integers:

Large Array Performance

Usage

System Requirements

Building the Library

Running Tests and Benchmarks

Examples

Basic Examples

Advanced Examples

Technical Details

Latest Optimization Highlights

Computational Complexity

Performance Tuning

License

About

Uh oh!

Releases 11

Packages

Uh oh!

Languages

License

davidesantangelo/vsort

Folders and files

Latest commit

History

Repository files navigation

VSort: Advanced Sorting Algorithm Optimized for Apple Silicon

Table of Contents

What's New in 1.0.0

Features & Optimizations

Apple Silicon Optimizations

Recent Improvements

Key Technical Features

Current Implementation Status

Parallel Workload Management

Performance

Performance Characteristics

Algorithm Comparison

Benchmark Results

Benchmark with 10,000 random integers:

Benchmark with 1,000,000 random integers:

Large Array Performance

Usage

System Requirements

Building the Library

Running Tests and Benchmarks

Examples

Basic Examples

Advanced Examples

Technical Details

Latest Optimization Highlights

Computational Complexity

Performance Tuning

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages 0

Uh oh!

Languages

Packages