Show HN: NanoEuler – GPT-2 scale model in pure C/CUDA from scratch

TL;DR

NanoEuler is a research project that builds a GPT-2 scale language model entirely from scratch using C and CUDA, without any ML libraries. It demonstrates complete training and inference pipelines, verified gradients, and runs on a single consumer GPU, highlighting low-level engineering in AI model development.

A developer has introduced NanoEuler, a GPT-2 scale language model built entirely from scratch in C and CUDA, with no reliance on external ML libraries. This project includes a complete training pipeline, verified backpropagation, and runs on a single consumer GPU, exemplifying low-level neural network engineering and educational purpose.

The project features a decoder-only transformer architecture with modern components like RMSNorm, Rotary position embeddings, SwiGLU feed-forward, and grouped-query attention. It trains a model of approximately 116 million parameters on a mixture of books and web data, using a hand-written CUDA engine that performs matrix multiplications, FlashAttention, and gradient calculations, validated against CPU reference implementations.

The training pipeline includes a byte-level BPE tokenizer, pretraining, and supervised fine-tuning towards a chat model, though it remains a research artifact rather than a practical chatbot. The entire process is built in public, emphasizing transparency and educational value.

At a glance

announcementWhen: ongoing, with public release and demons…

The developmentA developer has released NanoEuler, a GPT-2 scale language model implemented entirely in C/CUDA, with verified backpropagation and training pipeline, trained on a single GPU.

Implications for Low-Level Neural Network Engineering

By building a GPT-2 scale model entirely in C and CUDA without external dependencies, NanoEuler demonstrates the feasibility of low-level neural network implementation and training. This approach offers insights into the inner workings of large language models and provides an educational resource for understanding model architecture, gradient verification, and custom kernel development. It also highlights the potential for highly optimized, self-contained AI systems on consumer hardware, though it remains a research prototype with limited capabilities.

GPU Programming with C++ and CUDA: Uncover effective techniques for writing efficient GPU-parallel C++ applications

As an affiliate, we earn on qualifying purchases.

Background on From-Scratch Neural Network Implementations

Traditional large language models rely heavily on ML frameworks like PyTorch or TensorFlow, which abstract away hardware-specific optimizations and gradient calculations. Building models from scratch in C/CUDA is rare and primarily educational, often used for research into low-level optimizations or understanding core mechanics. Prior efforts have focused on small-scale models or partial implementations, but NanoEuler advances this by creating a full training pipeline for a GPT-2 scale model, verified through rigorous gradient checks.

The project is inspired by the residual Euler method for differential equations, with the name ‘Euler’ referencing the numerical integration technique. It aims to own every piece of the training process, from tokenization to kernel execution, providing transparency and control over the entire pipeline.

“This is a research and educational artifact, built in public, demonstrating how to implement a language model entirely from scratch in C/CUDA.”
— Project creator

Jetson Thor 128G Developer Kit AI Performance 2070 TFLOPS with SSD, AI Edge Computer for Autonomous Robots, LLM, Computer Vision

【AI Performance for Edge Computing】 Powered by N-VIDI-A Jetson AGX Thor module with 128GB memory and 2070 TFLOPS…

As an affiliate, we earn on qualifying purchases.

Limitations and Unconfirmed Capabilities

While NanoEuler successfully implements a GPT-2 style model with verified gradients and training pipeline, its language generation remains shallow, producing fluent but largely nonsensical text. The model’s knowledge is limited due to small size and data scope, and its practical utility as a chatbot or assistant is not demonstrated. The scalability to larger models and more complex tasks remains untested, and the project’s long-term stability or performance on different hardware is still uncertain.

Amazon

low-level neural network training tools

As an affiliate, we earn on qualifying purchases.

Future Developments and Research Directions

Next steps include expanding training data, increasing model size, and fine-tuning for specific tasks such as conversational AI. The developer plans to implement RLHF/DPO fine-tuning, improve tokenization, and optimize CUDA kernels further. Additionally, the project aims to serve as an educational resource for low-level neural network implementation, encouraging others to explore from-scratch model building.

Data Analysis from Scratch with Python: The Complete Beginner's Guide for Machine Learning Techniques and A Step By Step NLP using Python Guide To Expert (Including Programming Interview Questions)

As an affiliate, we earn on qualifying purchases.

Key Questions

Can NanoEuler be used for practical applications?

Currently, NanoEuler is a research and educational project, not optimized for practical use. Its models produce shallow, fluent text without real-world knowledge or robustness needed for deployment.

What are the main technical achievements of NanoEuler?

The project includes a fully handwritten training pipeline, verified backpropagation, and a CUDA engine with custom kernels, all built without external ML libraries, demonstrating low-level neural network implementation.

Will NanoEuler scale to larger models?

The current focus is on small to medium-sized models (~116M parameters). Scaling to larger models would require more data, compute, and further engineering, which are future goals.

Is the code open source?

Yes, the project is publicly available, emphasizing transparency, educational value, and community engagement.

How does NanoEuler verify the correctness of its gradients?

It performs a gradient check by comparing analytic gradients against finite difference approximations in double precision, achieving errors below 1e-4.

Source: Hacker News

Show HN: NanoEuler – GPT-2 scale model in pure C/CUDA from scratch

Up next

The US Used to Demand the Best Tech. Now We Ban It

Author

Deep Intellica Team

Share article

Implications for Low-Level Neural Network Engineering

GPU Programming with C++ and CUDA: Uncover effective techniques for writing efficient GPU-parallel C++ applications

Background on From-Scratch Neural Network Implementations

Jetson Thor 128G Developer Kit AI Performance 2070 TFLOPS with SSD, AI Edge Computer for Autonomous Robots, LLM, Computer Vision