Project Proposal | Parallel BVH Construction and Traversal

Summary

We propose to study and implement the parallelization of Bounding Volume Hierarchies (BVH) which is a core acceleration (spatial) data structure used for high performance rendering or even geometric queries. Our project’s main focus is to study and implement parallel BVH construction as well as efficient BVH traversal as it introduces irregular workloads, data dependencies, and memory access patterns which might make the parallelization not scalable.

Ray tracing will serve as the baseline application to evaluate and measure how BVH design choices impact end to end performance. To do so we will explore and compare different parallelization strategies for BVH construction and traversal across sequential CPU, OpenMP multi-core CPU, and CUDA GPU implementations. Here we will analyze tradeoffs in build time, travesal time, load balancing as well as memory behavior across ray tracing scenes of increasing complexity.

Background

Ray tracing is a rendering algorithm that helps simulate light by by casting rays from a camera into a specific scene and then evalutes light interactions. So for each pixel multiple rays interact and their contributions are then averaged to produced a final image. The major computational bottleneck in this is the ray-scene intersection testing. We can naively check every ray against every triangle in a scene which is O(N) per ray but it makes it infeasible for geometry’s which are complex.

A BVH is a tree based acceleration data structure which helps reduce this computational bottleneck of intersection testing to O(logN) per ray. Each internal node then stores an axis aligned bounding box enclosing all of the geometry in a subtree. So during a ray traversal, the ray ends up skipping entire subtrees whose bounding volumes will be missed thus reducing the number of intersection tests.

BVH construction and traversal are key components which determine the overall performance and while the traversal itself is “easily parallel” at the ray level using OpenMP or CUDA, it will introduce irregular control flow as well as memory access patterns. On the other hand, BVH construction has hierarchical dependencies making parallelization even more challenging.

Algorithm 1: BVH Traversal (Pseudocode)

function traverse(ray, node):

    if ray misses node.bbox:
        return NO_HIT          // skip entire subtree

    if node is leaf:
        return intersect(ray, node.triangles)

    hit_left  = traverse(ray, node.left)
    hit_right = traverse(ray, node.right)

    return closest(hit_left, hit_right)

Algorithm 2: BVH Construction (Pseudocode)

function buildBVH(primitives):

    if primitives.size() == 1:
        return leaf node(primitives)

    compute bounding box for primitives
    choose split axis (e.g., largest extent)
    sort primitives along chosen axis
    split primitives into left and right sets

    left_child  = buildBVH(left set)
    right_child = buildBVH(right set)

    return node(left_child, right_child, bounding box)

Challenges

This project has two primary challenges:

Data dependencies in BVH Construction
Traditional BVH construction is sequential in nature as each spatial split must be computed before its child nodes can be built which creates dependencies across tree levels. We will have to design parallelization strategies or explore different implementations or restructure problem to expose parallelism while maintaining the correctness of the structure.

Irregular Parallelism in BVH Traversal
While rays can be processed independently, traversal will introduce irregular control flow. Threads in same warp on a GPU may follow different paths through the BVH leading to warp divergence and also reduced utilization. Also traversal has pointer memory accesses with poor locality.

We will also implement parallelism on rays which is pretty straightforward and can be used as a comparison point before BVH is implemented.

Other challenges:

Load imablance among rays and geometry
Memory layout
Tradeoffs between construction and traversal efficiency

Resources

We will develop and benchmark our systems across both CPU and GPU platforms to evaulate how different architectures handle irregular parallel workloads such as BVH construction and traversal.

CPU Platforms (GHC Machines & PSC Bridges 2)

Multi-core CPUs using OpenMP for parallelization. This will allow us to study and analyze parallelism, load balancing, scalability and most importantly tradeoffs across cores.

GPU Platforms

GHC Machines: NVIDIA RTX 2080 GPUs for primary development and testing

PSC Bridges 2: (If Possible) NVIDIA V100 GPUs for large scale benchmarking

We will be using CPUs to check task parallelism as well as GPUs to analyze data parallel execution under SIMT. We will be focusing on challenges such as memory access patterns, load imbalance as well as warp divergence.

Goals & Deliverables

Plan to Achieve

Sequential CPU ray tracer (baseline for this project)
OpenMP parallelization for rendering (per ray)
Sequential BVH construction CPU
CUDA based BVH traversal
Exploration and research of parallel BVH construction strategies
IF strategies found implement them for BVH construction and traversal
Performance analysis of build time, traversal time, speedup vs threads/processors
Benchmarking across multiple pre-selected scenes like Standford bunny or dragon
End to end comparsion of CPU vs OpenMP vs GPU (CUDA)
Analysis: construction time, traversal time, warp efficiency, memory bandwidth on GHC RTX 2080

Hope to Achieve

Improve traversal efficiency for example ray grouping or sorting
Optimizing memory layout for better locality
Reduce divergence to try improving GPU locality
Interactive or real-time rendering demo comparing all

Fallback Plan

Use CPU built BVH and focusing on parallelization of BVH traversal
Perform analysis of divergence, memory behavior, scaling and most importantly tradeoffs
Deliver proper CPU vs GPU comparison as well asperformance study

Platform Choice

CUDA on NVIDIA GPUs for this project due to massive parallelism at the ray level. As we can process a lot of rays simultaneously.

Also we use:

C++ for CPU and OpenMP implementations
CUDA for GPU acceleration

We chose these to compare how different parallel execution models will be handling irregular workloads like BVH construction or traversal.

Schedule

Week	Planned Work
Week 1	Read research papers as well as any other relevant material on ray tracing and BVHs. Also understand how to implement sequential BVH construction and traversal. Then understand and check parallelization strategies for CPU and GPU. Finally, create a design with pseudocode.
Week 2	Implement and test baseline including ray tracing and BVH construction/traversal. Also add test scenes of varying complexity and validate correctness against brute-force.
Week 3	Implement OpenMP based parallelization on CPU for rendering and BVH components if possible. Then we will evaluate performance, if required add some test cases as well as identify bottlenecks. Then we begin drafting milestone report with out initial results and challenges.
Week 4	Create a CUDA-based GPU implementation with an emphasis on ray tracing integration and BVH traversal. If we can find any other parallelization strategies we will implement those and then verify accuracy, compare to the CPU baseline, and start profiling for divergence, memory behavior, scalability.
Week 5	Finalize GPU implementation and if possible improve the BVH construction parallelization more. We will then collect benchmark results, analyze tradeoffs and then work on the final report and demo.

References

Here are some general references which we found to understand BVH and ray tracing:

[1] P. Shirley, Ray Tracing in One Weekend, 2016.
https://raytracing.github.io/books/RayTracingInOneWeekend.html
[2] M. Pharr, W. Jakob, and G. Humphreys, Physically Based Rendering: From Theory to Implementation, 3rd ed., Morgan Kaufmann, 2016.
https://cw.fel.cvut.cz/b221/_media/courses/b4m39rso/lectures/physically_based_rendering_third_edition.pdf
[3] T. Karras, “Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees,” in Proceedings of High Performance Graphics (HPG), 2012.
https://research.nvidia.com/sites/default/files/pubs/2012-06_Maximizing-Parallelism-in/karras2012hpg_paper.pdf
[4] T. Karras and T. Aila, “Fast Parallel Construction of High-Quality Bounding Volume Hierarchies,” in Proceedings of High Performance Graphics (HPG), 2013.
https://research.nvidia.com/sites/default/files/pubs/2013-07_Fast-Parallel-Construction/karras2013hpg_paper.pdf