Earlier this year, I took a month to reexamine my coding habits and rethink some past design choices. I hope to rewrite and improve my FOSS libraries this year, and I needed answers to a few questions first. Perhaps some of these questions will resonate with others in the community, too.
- Are coroutines viable for high-performance work?
- Should I use SIMD intrinsics for clarity or drop to assembly for easier library distribution?
- Has hardware caught up with vectorized scatter/gather in AVX-512 & SVE?
- How do secure enclaves & pointer tagging differ on Intel, Arm, & AMD?
- What's the throughput gap between CPU and GPU Tensor Cores (TCs)?
- How costly are misaligned memory accesses & split-loads, and what gains do non-temporal loads/stores offer?
- Which parts of the standard library hit performance hardest?
- How do error-handling strategies compare overhead-wise?
- What's the compile-time vs. run-time trade-off for lazily evaluated ranges?
- What practical, non-trivial use cases exist for meta-programming?
- How challenging is Linux Kernel bypass with io_uring vs. POSIX sockets?
- How close are we to effectively using Networking TS or heterogeneous Executors in C++?
- What are best practices for propagating stateful allocators in nested containers, and which libraries support them?
These questions span from micro-kernel optimizations (nanoseconds) to distributed systems (micro/millisecond latencies). Rather than tackling them all in one post, I compiled my explorations into a repository—extending my previous Google Benchmark tutorial (<https://ashvardanian.com/posts/google-benchmark>)—to serve as a sandbox for performance experimentation.Some fun observations:
- Compilers now vectorize 3x3x3 and 4x4x4 single/double precision multiplications well! The smaller one is ~60% slower despite 70% fewer operations, outperforming my vanilla SSE/AVX and coming within 10% of AVX-512.
- Nvidia TCs vary dramatically across generations in numeric types, throughput, tile shapes, thread synchronization (thread/quad-pair/warp/warp-groups), and operand storage. Post-Volta, manual PTX is often needed (as intrinsics lag), though the new TileIR (introduced at GTC) promises improvements for dense linear algebra kernels.
- The AI wave drives CPUs and GPUs to converge in mat-mul throughput & programming complexity. It took me a day to debug TMM register initialization, and SME is equally odd. Sierra Forest packs 288 cores/socket, and AVX10.2 drops 256-bit support for 512-bit... I wonder if discrete Intel GPUs are even needed, given CPU advances?
- In common floating-point ranges, scalar sine approximations can be up to 40x faster than standard implementations, even without SIMD. It's a bit hand-wavy, though; I wish more projects documented error bounds and had 1 & 3.5 ULP variants like Sleef.
- Meta-programming tools like CTRE can outperform typical RegEx engines by 5x and simplify building parsers compared to hand-crafted FSMs.
- Once clearly distinct in complexity and performance (DPDK/SPDK vs. io_uring), the gap is narrowing. While pre-5.5 io_uring can boost UDP throughput by 4x on loopback IO, newer zero-copy and concurrency optimizations remain challenging.
The repository is loaded with links to favorite CppCon lectures, GitHub snippets, and tech blog posts. Recognizing that many high-level concepts are handled differently across languages, I've also started porting examples to Rust & Python in separate repos. Coroutines look bad everywhere :(Overall, this research project was rewarding! Most questions found answers in code — except pointer tagging and secure enclaves, which still elude me in public cloud. I'd love to hear from others, especially on comparing High-Level Synthesis for small matrix multiplications on FPGAs versus hand-written VHDL/Verilog for integral types. Let me know if you have ideas for other cool, obscure topics to cover!
loading...