Neural Graphics Now 1000x Faster with Smart Encoding

In 2022, a team of four researchers at NVIDIA published a paper with an abstract so bold it almost sounded like a joke. They claimed they could train a neural network to represent a 3D scene from scratch, from a handful of photographs, in a matter of seconds. Not hours. Not minutes. Seconds. And then render that scene in real time at full HD resolution.

The paper was titled "Instant neural graphics primitives with a multiresolution hash encoding." The lead author was Thomas Müller. His coauthors were Alex Evans, Christoph Schied, and Alexander Keller. The paper was published in ACM Transactions on Graphics and has since accumulated over 3,600 citations.

The claim was not a joke. It was a genuine breakthrough that changed how fast neural graphics could work. The key was not a bigger neural network or more GPU memory. It was a smarter way to encode the input data, using a trick that sounds almost too simple: a hash table.

The Problem That Nobody Solved Fast Enough

Neural graphics primitives are a class of algorithms that use small neural networks to represent complex visual data. You give the network a coordinate, say a point in 3D space, and it returns a color or a density value. Do this for millions of points, and you can reconstruct a full 3D scene from 2D images.

The most famous example is NeRF, or Neural Radiance Fields, introduced in 2020. NeRF could produce stunningly realistic novel views of a scene from a sparse set of photographs. But it had a fatal flaw: training took hours on a single GPU. Rendering a single frame took minutes. This was not a tool for artists or engineers. It was a research curiosity.

The bottleneck was the neural network itself. To represent fine details like hair, grass, or text, the network needed to be large. Large networks meant more floating point operations, more memory accesses, and slower training. The standard approach was to use a positional encoding, a mathematical trick that mapped input coordinates to higher dimensional space. This helped the network learn high frequency details, but it still required a big network to handle complex scenes.

Müller and his team asked a different question. What if the network could be small and dumb, and the encoding did all the heavy lifting?

The Hash Table That Learned on the Job

The core idea in the paper is deceptively simple. Instead of feeding raw coordinates into a neural network, the authors first map each coordinate to a set of feature vectors stored in a multiresolution hash table. The network then reads these feature vectors and uses them to produce its output.

Here is how it works. Imagine you have a 3D scene. You divide it into multiple grids at different resolutions, like a pyramid of increasingly fine grids. At the coarsest level, each grid cell is large. At the finest level, each cell is tiny. For each grid cell at each resolution, you store a feature vector, a small array of numbers that the network learns during training.

When you want to query a point in the scene, you look up which grid cell it falls into at each resolution. You fetch the feature vector for that cell. You then concatenate all these feature vectors from all resolutions into a single long vector and feed it to the neural network.

The trick is that the feature vectors are stored in a hash table, not a dense array. A dense array would require an enormous amount of memory. For a high resolution grid with billions of cells, you would need terabytes of memory. A hash table uses a fixed amount of memory, say a few megabytes, and maps each grid cell to a random index in the table. Multiple grid cells can map to the same index. This is called a hash collision.

Collisions sound like a bug. They are actually a feature. At low resolutions, collisions are rare because the grid cells are large. At high resolutions, collisions become frequent. But the multiresolution structure means that even if two points collide at the finest level, they will have different feature vectors at coarser levels. The network learns to use the coarser features to disambiguate the collisions. The hash table does not need to be perfect. It just needs to be good enough.

The authors demonstrated this approach on three tasks: training a neural representation of a 3D scene from 2D images, training a signed distance function from point clouds, and training a neural radiance field for view synthesis. In every case, the hash encoding dramatically reduced training time.

The Numbers Are Hard to Believe

Müller and his team reported that their method could train a high quality neural graphics primitive in seconds. Specifically, they achieved a combined speedup of several orders of magnitude compared to prior methods. They trained a NeRF equivalent on a scene with 100 images in about 5 seconds on a single NVIDIA RTX 3090 GPU. Rendering a single frame at 1920x1080 resolution took about 15 milliseconds.

To put that in perspective, the original NeRF paper required 1 to 2 days of training on a single GPU. The fastest prior methods, like Mip-NeRF, still needed 10 to 20 minutes. The hash encoding approach was 1000x faster than the original NeRF and about 100x faster than the best prior methods.

The quality was not sacrificed. The authors showed that their method produced images with comparable or better visual fidelity than state of the art techniques. The neural network itself was tiny: just two hidden layers with 64 neurons each. The hash table was the real workhorse.

How They Made It So Fast

The speed comes from three design choices, each carefully optimized.

First, the network is small. A typical NeRF network has 8 to 12 layers with 256 to 512 neurons per layer. The hash encoding network has 2 layers with 64 neurons. This means far fewer floating point operations per query. The authors reported that their network required 1000x fewer FLOPs than a standard NeRF network.

Second, the hash table is parallelizable. Modern GPUs are designed for massive parallelism. The hash table lookups are independent for each query point. The authors implemented the entire system using fully fused CUDA kernels, meaning all operations for a batch of queries are combined into a single kernel launch. This minimizes the overhead of launching multiple kernels and reduces memory bandwidth waste.

Third, the hash table is trainable. The feature vectors in the hash table are optimized through stochastic gradient descent, just like the weights of the neural network. This means the encoding adapts to the specific scene during training. The network does not need to learn to interpret the encoding. The encoding learns to represent the scene efficiently.

The authors also used a technique called automatic differentiation to compute gradients through the hash table lookups. This is not trivial. Hash table lookups are non differentiable because they involve discrete indexing. But the authors showed that you can treat the lookup as a multiplication by a constant matrix, which makes gradient computation straightforward.

What This Actually Changes

Before this paper, neural graphics primitives were a research curiosity. They were too slow for practical use. After this paper, they became a tool that could be deployed in real applications.

The most immediate impact was in 3D reconstruction and view synthesis. With the hash encoding, you can take a smartphone video of an object, run it through a neural network for a few seconds, and get a full 3D model that you can view from any angle. This is not a theoretical possibility. It is a product that NVIDIA shipped as Instant NeRF in 2022.

But the impact goes beyond NeRF. The hash encoding is a general purpose input encoding for any neural network that operates on coordinates. It has been applied to signed distance functions, which are used in robotics and computer aided design. It has been applied to neural radiance fields for dynamic scenes. It has been applied to audio generation. The same idea works wherever you need to represent a continuous function over a spatial or temporal domain.

What the Paper Does Not Prove

The hash encoding is not a universal solution. It has limitations that the authors acknowledged and that subsequent research has explored.

First, the hash table requires memory proportional to the resolution. For very large scenes, like an entire city, the hash table would need to be enormous to capture fine details. The authors used a table with about 2 million entries for typical scenes. Scaling to city scale would require a different approach.

Second, the hash encoding works best for static scenes. Dynamic scenes, where objects move over time, require additional mechanisms. The authors showed that you can add a time coordinate to the input, but this increases the complexity of the hash table.

Third, the hash encoding does not generalize across scenes. Each scene requires its own training run. You cannot train a model on one scene and apply it to another. This is true for most neural graphics primitives, but it means the method is not a replacement for traditional graphics pipelines in all cases.

Fourth, the hash encoding is optimized for small networks. If you need a large network for some reason, the hash encoding may not provide as much benefit. The authors designed their approach specifically for the regime where the network is the bottleneck.

Why This Matters Beyond Graphics

The hash encoding is a specific solution to a specific problem, but it illustrates a general principle that applies across machine learning: the choice of input representation matters more than the size of the network.

For years, the trend in deep learning was to build bigger and bigger models. GPT3 had 175 billion parameters. The largest vision models had billions of parameters. The assumption was that more parameters meant more capacity to learn complex functions.

Müller and his team showed that you can achieve the same or better results with a tiny network if you give it the right input representation. The hash table is a learned encoding that adapts to the data. It is not a fixed transformation like a Fourier encoding or a positional encoding. It is optimized during training to minimize the loss.

This insight has implications beyond graphics. Any problem that involves learning a continuous function over a domain can benefit from a learned encoding. This includes physics simulation, weather modeling, medical imaging, and scientific computing. The hash encoding is already being used in these fields.

The Open Questions That Remain

The hash encoding paper solved a practical problem, but it also raised theoretical questions that researchers are still exploring.

Why does the multiresolution structure work so well? The authors showed empirically that it works, but they did not provide a rigorous mathematical explanation. The hash table is essentially a sparse representation of the scene. The multiresolution structure ensures that the representation is both coarse and fine grained. But the exact mechanism by which the network resolves hash collisions is not fully understood.

Can the hash encoding be made differentiable in a more principled way? The authors used a trick to compute gradients through the hash table lookups. But there may be better ways to do this that could improve training stability or convergence speed.

Is there a theoretical limit to how small the network can be? The authors used a network with 2 layers and 64 neurons. Could you use a network with 1 layer and 10 neurons? Probably not, but the lower bound is unknown.

What This Actually Means

Here are the direct takeaways from the paper, stripped of hype and translated into practical insights.

▸The bottleneck in neural graphics is not the network size. It is the input encoding. If you want faster training, do not add more layers. Change how you represent the input. The hash encoding gives you a 1000x speedup by making the input representation learnable and efficient.

▸Hash collisions are not a bug. They are a feature. The multiresolution structure means that collisions at fine resolutions are resolved by information from coarser resolutions. This is a clever way to compress the representation without losing quality. You can use a small hash table and still represent complex scenes.

▸The method is general purpose. It works for 3D scenes, signed distance functions, and neural radiance fields. It can be applied to any problem where you need to represent a continuous function over a spatial or temporal domain. If you are working on coordinate based neural networks, try the hash encoding.

▸The implementation matters as much as the idea. The authors did not just propose a new encoding. They implemented it using fully fused CUDA kernels that minimize memory bandwidth and compute overhead. The speedup comes from both the algorithmic innovation and the engineering optimization. Do not underestimate the importance of good engineering.

▸The paper is a model of clear thinking. The authors identified a specific bottleneck, designed a simple solution, and validated it with rigorous experiments. They did not add unnecessary complexity. They did not claim more than they could prove. The paper is worth reading for the clarity of its argument alone.

The hash encoding paper is a reminder that sometimes the biggest breakthroughs come from the simplest ideas. A hash table. A small network. A few seconds of training. And suddenly, a field that was stuck in the research lab becomes a practical tool.

That is the kind of progress that changes how we build things. And it started with four researchers who asked a simple question: what if the network did not have to be the smart part?

References

[1]Thomas Müller, Alex Evans, Christoph Schied, Alexander Keller (2022). Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on GraphicsDOI· 3,609 citations

Neural Graphics Now 1000x Faster with Smart Encoding