We propose to study and implement the parallelization of Bounding Volume Hierarchies (BVH) which is a core acceleration (spatial) data structure used for high performance rendering or even geometric queries. Our project’s main focus is to study and implement parallel BVH construction as well as efficient BVH traversal as it introduces irregular workloads, data dependencies, and memory access patterns which might make the parallelization not scalable.
Ray tracing will serve as the baseline application to evaluate and measure how BVH design choices impact end to end performance. To do so we will explore and compare different parallelization strategies for BVH construction and traversal across sequential CPU, OpenMP multi-core CPU, and CUDA GPU implementations. Here we will analyze tradeoffs in build time, travesal time, load balancing as well as memory behavior across ray tracing scenes of increasing complexity.
Ray tracing is a rendering algorithm that helps simulate light by by casting rays from a camera into a specific scene and then evalutes light interactions. So for each pixel multiple rays interact and their contributions are then averaged to produced a final image. The major computational bottleneck in this is the ray-scene intersection testing. We can naively check every ray against every triangle in a scene which is O(N) per ray but it makes it infeasible for geometry’s which are complex.
A BVH is a tree based acceleration data structure which helps reduce this computational bottleneck of intersection testing to O(logN) per ray. Each internal node then stores an axis aligned bounding box enclosing all of the geometry in a subtree. So during a ray traversal, the ray ends up skipping entire subtrees whose bounding volumes will be missed thus reducing the number of intersection tests.
BVH construction and traversal are key components which determine the overall performance and while the traversal itself is “easily parallel” at the ray level using OpenMP or CUDA, it will introduce irregular control flow as well as memory access patterns. On the other hand, BVH construction has hierarchical dependencies making parallelization even more challenging.
function traverse(ray, node):
if ray misses node.bbox:
return NO_HIT // skip entire subtree
if node is leaf:
return intersect(ray, node.triangles)
hit_left = traverse(ray, node.left)
hit_right = traverse(ray, node.right)
return closest(hit_left, hit_right)
function buildBVH(primitives):
if primitives.size() == 1:
return leaf node(primitives)
compute bounding box for primitives
choose split axis (e.g., largest extent)
sort primitives along chosen axis
split primitives into left and right sets
left_child = buildBVH(left set)
right_child = buildBVH(right set)
return node(left_child, right_child, bounding box)
This project has two primary challenges:
Data dependencies in BVH Construction
Traditional BVH construction is sequential in nature as each spatial split must be computed before its child nodes can be built which creates dependencies across tree levels. We will have to design parallelization strategies or explore different implementations or restructure problem to expose parallelism while maintaining the correctness of the structure.
Irregular Parallelism in BVH Traversal
While rays can be processed independently, traversal will introduce irregular control flow. Threads in same warp on a GPU may follow different paths through the BVH leading to warp divergence and also reduced utilization. Also traversal has pointer memory accesses with poor locality.
We will also implement parallelism on rays which is pretty straightforward and can be used as a comparison point before BVH is implemented.
Other challenges:
- Load imablance among rays and geometry
- Memory layout
- Tradeoffs between construction and traversal efficiency
We will develop and benchmark our systems across both CPU and GPU platforms to evaulate how different architectures handle irregular parallel workloads such as BVH construction and traversal.
CPU Platforms (GHC Machines & PSC Bridges 2)
Multi-core CPUs using OpenMP for parallelization. This will allow us to study and analyze parallelism, load balancing, scalability and most importantly tradeoffs across cores.
GPU Platforms
GHC Machines: NVIDIA RTX 2080 GPUs for primary development and testing
PSC Bridges 2: (If Possible) NVIDIA V100 GPUs for large scale benchmarking
We will be using CPUs to check task parallelism as well as GPUs to analyze data parallel execution under SIMT. We will be focusing on challenges such as memory access patterns, load imbalance as well as warp divergence.
Plan to Achieve
- Sequential CPU ray tracer (baseline for this project)
- OpenMP parallelization for rendering (per ray)
- Sequential BVH construction CPU
- CUDA based BVH traversal
- Exploration and research of parallel BVH construction strategies
- IF strategies found implement them for BVH construction and traversal
- Performance analysis of build time, traversal time, speedup vs threads/processors
- Benchmarking across multiple pre-selected scenes like Standford bunny or dragon
- End to end comparsion of CPU vs OpenMP vs GPU (CUDA)
- Analysis: construction time, traversal time, warp efficiency, memory bandwidth on GHC RTX 2080
Hope to Achieve
- Improve traversal efficiency for example ray grouping or sorting
- Optimizing memory layout for better locality
- Reduce divergence to try improving GPU locality
- Interactive or real-time rendering demo comparing all
Fallback Plan
- Use CPU built BVH and focusing on parallelization of BVH traversal
- Perform analysis of divergence, memory behavior, scaling and most importantly tradeoffs
- Deliver proper CPU vs GPU comparison as well asperformance study
CUDA on NVIDIA GPUs for this project due to massive parallelism at the ray level. As we can process a lot of rays simultaneously.
Also we use:
- C++ for CPU and OpenMP implementations
- CUDA for GPU acceleration
We chose these to compare how different parallel execution models will be handling irregular workloads like BVH construction or traversal.
| Week | Planned Work |
|---|---|
| Week 1 | Read research papers as well as any other relevant material on ray tracing and BVHs. Also understand how to implement sequential BVH construction and traversal. Then understand and check parallelization strategies for CPU and GPU. Finally, create a design with pseudocode. |
| Week 2 | Implement and test baseline including ray tracing and BVH construction/traversal. Also add test scenes of varying complexity and validate correctness against brute-force. |
| Week 3 | Implement OpenMP based parallelization on CPU for rendering and BVH components if possible. Then we will evaluate performance, if required add some test cases as well as identify bottlenecks. Then we begin drafting milestone report with out initial results and challenges. |
| Week 4 | Create a CUDA-based GPU implementation with an emphasis on ray tracing integration and BVH traversal. If we can find any other parallelization strategies we will implement those and then verify accuracy, compare to the CPU baseline, and start profiling for divergence, memory behavior, scalability. |
| Week 5 | Finalize GPU implementation and if possible improve the BVH construction parallelization more. We will then collect benchmark results, analyze tradeoffs and then work on the final report and demo. |
Here are some general references which we found to understand BVH and ray tracing:
-
[1] P. Shirley, Ray Tracing in One Weekend, 2016.
https://raytracing.github.io/books/RayTracingInOneWeekend.html -
[2] M. Pharr, W. Jakob, and G. Humphreys, Physically Based Rendering: From Theory to Implementation, 3rd ed., Morgan Kaufmann, 2016.
https://cw.fel.cvut.cz/b221/_media/courses/b4m39rso/lectures/physically_based_rendering_third_edition.pdf -
[3] T. Karras, “Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees,” in Proceedings of High Performance Graphics (HPG), 2012.
https://research.nvidia.com/sites/default/files/pubs/2012-06_Maximizing-Parallelism-in/karras2012hpg_paper.pdf -
[4] T. Karras and T. Aila, “Fast Parallel Construction of High-Quality Bounding Volume Hierarchies,” in Proceedings of High Performance Graphics (HPG), 2013.
https://research.nvidia.com/sites/default/files/pubs/2013-07_Fast-Parallel-Construction/karras2013hpg_paper.pdf