Optimization loops for AI hardware have effectively closed. DeepReinforce researchers published a new system this week capable of generating low-level machine code that outperforms NVIDIA’s own hand-tuned libraries by up to 28%.
Dubbed CUDA-L2, the framework leverages Large Language Models (LLMs) guided by Reinforcement Learning (RL) to automate the creation of Half-precision General Matrix Multiply (HGEMM) kernels. These operations are the mathematical bedrock of modern AI training.
Benchmarks released on the GitHub repository show the AI-generated code beating standard libraries across 1,000 distinct configurations on Ampere A100 chips, though support for newer Hopper architectures remains in development.
Benchmarking the Breakthrough
Performance metrics reveal a significant gap between AI-generated kernels and standard libraries. In server scenarios simulating real-time inference, CUDA-L2 achieves a +28.7% speedup over `torch.matmul`. Against NVIDIA’s most optimized baseline, `cuBLASLt-AutoTuning`, the system maintains a +15.9% advantage.
Comparing the results to established industry standards, Songqiao Su, a researcher at DeepReinforce, stated: “CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used torch.matmul to state-of-the-art Nvidia’s closed-source libraries.”
Testing covered 1,000 distinct matrix dimension combinations (M, N, K) rather than a few cherry-picked examples. Offline benchmarks, which run kernels consecutively without pauses, showed slightly lower but still substantial gains of +22.0% over PyTorch.
Such consistent results challenge the assumption that hand-tuned vendor libraries are the performance ceiling. By automating the discovery of optimal configurations, the system exposes inefficiencies in general-purpose libraries that must balance performance across a wide range of use cases.
The Methodology: LLMs Writing Assembly
DeepReinforce’s approach fundamentally shifts kernel optimization from heuristic design to automated exploration. Leveraging Large Language Models (LLMs), it generates candidate kernels which are then refined via Reinforcement Learning (RL).
Describing the scale of the automated search process, Su noted: “Even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans.”
To prevent “lazy” evaluation errors common in Python-based testing, the system validates correctness against FP32 CPU references. This validation step ensures that speed gains are not achieved by sacrificing numerical precision or stability.
Navigating this complexity, the RL agent explores configuration spaces that are too vast for human engineers to manually tune effectively. By iterating on thousands of potential kernel designs, the system identifies non-obvious optimizations that standard heuristics miss.
According to the research paper, the performance gains are particularly pronounced in realistic deployment scenarios:
“In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0% over torch.matmul on average; +19.2% over cuBLAS using the optimal layout configuration… and +11.4% over the most competitive cuBLASLt-AutoTuning model.”
“In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7%, +26.0%, +22.4%, and +15.9% for torch.matmul, cuBLAS, cuBLASLt-heuristic, and cuBLASLt-AutoTuning respectively.”
These figures highlight the practical impact of the optimization strategy. While offline gains demonstrate raw throughput improvements, the larger speedups in server mode suggest that the generated kernels handle intermittent workloads more efficiently than their hand-tuned counterparts.
Hardware Constraints & Market Impact
Current optimization is strictly limited to the NVIDIA A100 (Ampere) architecture. Support for newer architectures like Hopper (H100) and Blackwell is planned but not yet available. Consequently, immediate benefits are restricted to legacy or existing enterprise clusters rather than cutting-edge deployments.
The GitHub documentation explicitly outlines these hardware limitations:
“Ideally, kernels trained on A100 should only be used on A100 if you are targeting speedup. They might have speedup on other machines, but it’s not guaranteed. We will progressively release kernels trained on different machines.”
Reducing the computational cost of HGEMM operations directly impacts the bottom line for large-scale model training. As matrix multiplication consumes the majority of GPU cycles in LLM workloads, even marginal efficiency gains translate to millions in savings.
Analyzing the economic implications, Rohan Paul, an AI Analyst at Rohan’s Bytes, observed: “For LLM pretraining and fine tuning, most of the GPU time is spent doing these HGEMM matrix multiplies, so if those kernels run about 10% to 30% faster, the whole training or tuning job can get noticeably cheaper.”
Releasing the code on GitHub allows researchers to verify these claims independently, though third-party validation is still pending. If the methodology proves transferable to newer architectures, it could force a rethinking of how low-level GPU libraries are developed and maintained.

