Render blog by Chris Hellmuth

GPU-Motunui

Disney Animation’s Moana island dataset is a production-scale scene with memory requirements that make it challenging to render. This post summarizes some of those challenges, and describes how the GPU-Motunui project is able to efficiently render the scene on a consumer-grade GPU with less than 8GB of memory. Click here to skip ahead to the results.

The Moana island

In 2018, Disney Animation released the Moana island dataset to the rendering research community. Compared to traditional research scenes, the scale of the Moana island scene is massive: the scene contains 90 million quad primitives, 5 million curves, and more than 28 million instances. All told, the island consists of over 15 billion primitives, weighing in at just under 30GB of geometry files.

The shots included with the dataset are beautiful, and showcase the amazing imagery that can be created by combining the best artists in the world with path tracing techniques and modern hardware. Here are two reference images, rendered with Disney’s proprietary Hyperion renderer:

Hyperion shotCam reference
Hyperion shotCam reference
Hyperion beachCam reference
Hyperion beachCam reference

GPU-Motunui Project

The goal of the GPU-Motunui project is to render all the Moana shots efficiently and accurately on a consumer-grade graphics card. There are two main challenges to accomplishing this with the Moana dataset. First, with a typical graphics card having only 8GB of memory, an out-of-core rendering solution is required to handle the large amounts of geometry. Second, the scene’s textures are provided in the Ptex format, and Ptex doesn’t have a publicly available CUDA implementation. This project currently only solves the first problem, and Ptex texture lookup is done on the CPU (although conveniently its cost is fully hidden by being computed concurrently with GPU shadow ray tracing).

The Hyperion reference images are impossible to match exactly; for example the varying brown and green colors along the palm tree fronds in the palmsCam shot are not provided in the dataset. Other features of the scene are possible to render but out of my initial scope, notably subdivision surfaces and their displacement maps, and a full Disney BSDF implementation.

Example of an unreproducible material variation on the palm tree frond
Example of an unreproducible material variation on the palm tree frond

All ray tracing operations are run through Nvidia’s OptiX 7 API. This means GPU-Motunui gets the full benefits of available RT cores and a world-class BVH implementation. The following sections describe how GPU-Motunui maps dataset assets to OptiX data structures, and how GPU-Motunui’s out-of-core rendering solution works.

Scene representation

The Moana scene makes widespread use of multi-level instancing. In OptiX, this requires a three-level hierarchy of acceleration structures to manage: two levels of IASs, and a base level of GASs (Instance Acceleration Structures and Geometry Acceleration Structures, respectively). GPU-Motunui makes use of OptiX’s AS compaction and relocation APIs to further reduce memory usage.

The isHibiscus element makes a good example of how a typical element in the scene is organized and built. The tree is assembled from a base model in one Wavefront .obj file (containing the trunk and branches), and four primitives: one flower and three leaf models (each with their own .obj file).

Left: The four simple primitives that will be instanced to fill out the hibiscus tree
Right: The base trunk and branches model

In OptiX, each of these models has an associated GAS, and each GAS can be subdivided into multiple build inputs. Build inputs are used to map sections of the model to information needed at shading time by indexing into OptiX’s shader binding table. These GASs form the bottom level of the hierarchy.

Next, an IAS is used to build the full isHibiscus element. This IAS is in the middle level of the hierarchy. The figure below shows each primitive’s instances in isolation, and combined to make the full element:

Left: Isolated instances for each primitive
Right: Full isHibiscus element

Finally, a second IAS is built to track all of the element’s instances present in the scene. This second IAS is the top level of the instance hierarchy.

The shotCam view rendered with only isHibiscus instances
The shotCam view rendered with only isHibiscus instances

Although the isHibiscus element has a typical structure, there are some more complicated elements included in the dataset. The isCoral element, for example, has different base geometry and instanced primitives for each of its element instances, but the underlying primitive geometries are shared across all the element instances.

The Moana GAS and IASs alone require 18.5 GB, well past the memory budget of my 8GB RTX 2070. Because OptiX has no native support for out-of-core rendering, the traditional OptiX pipeline had to be put aside for a custom-made solution.

Out-of-core rendering

To solve the out-of-core rendering problem, GPU-Motunui divides the scene’s geometry into different sections, and ray traces each separately, while tracking the closest hit. Replacing a traditional device trace call with a host loop comes with many design consequences to the renderer, from asset loading to the core path tracing loop that sends rays through the scene.

Before rendering, the asset loading process allocates a large chunk of GPU memory (currently 6.7GB). A custom allocator is implemented that manages this block of memory. It is responsible for allocating two types of memory: output and temporary. Output memory is allocated from the left of the block, and is used for OptiX structures. Temporary memory is managed on a stack from the right end of the memory block. Managing the temporary memory this way ensures that the output structures are always tightly packed.

After elements are processed into their accelerator structures on the GPU, their used memory is snapshotted onto the host, and the allocator is cleared. The process is repeated until all of the scene’s geometry is processed, resulting in the host managing a list of GPU memory snapshots. The figure below shows an example layout of GPU memory that could be snapshotted:

GPU memory layout after loading the isHibiscus element.
(Dotted arrows show that an IAS holds instances of the pointed-at AS)

As mentioned above, when it comes time to ray trace, each snapshot is processed in a loop. This means a call to cudaMemcpy and optixLaunch for each snapshot. A global buffer is maintained that indicates the depth of the current closest intersection. This value is used as the tmax parameter for the CUDA kernel’s call to optixTrace, and a successful intersection will update the depth buffer for the next launch.

In a traditional OptiX path tracer, the entire render loop can run in device code inside a single call to optixLaunch; i.e., a successful intersection will lead to more BSDF and shadow rays being traced in the same kernel launch. Because GPU-Motunui’s design mandates multiple launches for tracing each path segment, the render loop is pulled out into host code. While this potentially diminishes OptiX’s ability to efficiently schedule program execution, it also opens up opportunties for optimization, such as running Ptex texture lookups on the CPU concurrently with GPU kernels and I/O.

Shading

As with any OptiX application, GPU-Motunui makes use of the shader binding table (SBT). SBT records contain pointers to normal buffers and material attributes. The underlying data for the normal buffers is stored alongside OptiX acceleration structures and included in geometry snapshots. This ensures that GPU memory is never wasted on unreachable normal buffer data.

Renders

Included below are GPU-Motunui renders of the six scenes included in the dataset. shotCam is the slowest to render at 18.2 seconds per sample at 1024x429 resolution, and took just over five hours total for the final image. All shots are 1024spp, capped at a maximum of five bounces, and were run on an Nvidia RTX 2070.

shotCam
shotCam
beachCam
beachCam
dunesACam
dunesACam
palmsCam
palmsCam
birdseyeCam
birdseyeCam
rootsCam
rootsCam
grassCam
grassCam

Optimization

The initial implementation of the renderer required 42.6 seconds per 1spp on the shotCam scene. A few optimizations combined to make significant reductions in rendering time, cutting each pass down to 18.2 seconds (a 57.3% reduction).

CPU/GPU concurrency

Tracing shadow rays on the GPU in parallel with Ptex lookups on the CPU cut rendering time by 23.4%. It was disappointing to be forced to do texture lookups on the CPU, but the time savings make up for it.

Multiple Ptex caches

Parallelizing the Ptex lookups and using multiple Ptex caches eliminated texture lookups as a bottleneck to the system; shadow ray casting time fully dominates the texture lookup. Empirically, spawning two threads per core (totaling 12 on an Intel i7-8700K) and sharing three Ptex caches comfortably reduced the texture lookup time beneath the shadow ray budget. This improved the time savings to a 33.9% reduction over the baseline.

Pinned memory

The acceleration structure snapshots are all saved to pinned host memory. Switching from normal to pinned host memory increased the transfer throughput from 7.73 GB/s to 11.84 GB/s, cutting the baseline render time by 19.5%.

Future Steps

Getting this scene running on my RTX 2070 card was a very fun and rewarding project, but there are still many improvements to be made:

  • Implementing the Disney BSDF
  • Rendering subdivision surfaces along with displacement mapping
  • More efficiently packing the acceleration structures, and optimizing ray tracing throughput
  • Experimenting with how various research results hold up on production scenes (e.g., testing select path guiding techniques)

References