15618 Spring 26

Parallel BVH Construction and Traversal for Accelerated Ray Tracing

Project Milestone Report
Names: Soham Narendra Joshi, Viren Dodia
Andrew IDs: sohamnaj, vdodia
Work Completed So Far

We have implemented a ray tracing pipeline which supports both primitive (spheres and triangles) and mesh-based scenes. The code is implemented such that we can construct multiple scenes of varying complexity including small scenes like random 100 spheres and dataset such as Standford bunny and dragon. We also implemented a correctness validation pipeline using a feature flag “check” which helps compare based on PPM and ensure consistency so that we can validate if our parallel implementation are correct are not.

On top of the renderer, we have implemented a sequential BVH construction as well as traversal. This we completed using top down mid split recursive apporach to build the tree. As this is slow for big scenes like dragon, we started our parallel implementations. We have implemented OpenMP-based BVH constructions using task parallelism for recursion (left and right). We also implemented an optimized top-down BVH construction using a range based (start,end) representation for the BVH tree. This was to eliminate the vector copying happening every recursion.

Progress Toward Goals & Updated Deliverables

Our primary goal is to explore and evaulate different parallelization strategies for BVH construction and understand the scabilitity limitations and try identifying more parallel friendly base construction.

We are currently on track with respect to our deliverables as our main goal was to get the base ray tracing and correctness validation done properly so we can use it for our parallelization strategies. As all of this is done, we have begun implementing multiple parallelization strategies using CPU and GPU with optimizations as we discover new observations.

As we didnt mention properly in our original proposal report, our main focus is to discover how we can parallelize bvh construction and use multiple strategies taught to us in class, analyze them, and finally optimize it.

We still believe we will be able to deliver all our deliverables (which we updated on our website according to feedback). However, based on our early results, we decided to refine our approach for a staged exploration of BVH parallelization.

Updated Goals
  • Sequential ray tracer and BVH baseline
  • Naive OpenMP BVH construction
  • Optimized range-based OpenMP BVH constuction
  • Evaluate task-based parallel BVH construction (which is our current approach)
  • Explore task granularity tuning (threshold-based task spawning, still need to explore)
  • Analyze scaling behavior vs scene size and thread count
  • Compare coarse-grained vs fine-grained parallelism strategies
  • Measure impact of parallel overhead
  • Implement CPU traversal parallelization too
  • Evaluate independent ray parallelism vs BVH construction parallelism
  • Explore alternative BVH construction strategies (e.g., LBVH) (this is important for us as this might help us achieve much bettter parallelization)
  • Comparing original BVH vs parallel-friendly linear BVH approaches
  • Implement GPU-based BVH traversal and/or construction (CUDA)
  • Also implementing GPU-based for new alternative strategy
  • Evaluate GPU performance factors such as warp divergence and memory behavior
  • Perform comprehensive performance analysis comparing all of the tried strategies and understanding the tradeoffs.
Nice to haves
  • SIMD optimizations for traversal
  • Improved memory layout for better cache locality
  • Advanced GPU optimizations like reducing divergence, improving memory coalescing
  • Hybrid CPU-GPU strategy for BVH construction plus traversal (if our implementations are complimentary to each other, we will try combining them into a hybrid approach)
Updated Schedule
Week 1
Read research papers as well as any other relevant material on ray tracing and BVHs. Also understand how to implement sequential BVH construction and traversal. Then understand and check parallelization strategies for CPU and GPU. Finally, create a design with pseudocode.
Completed
Week 2
Implement and test baseline including ray tracing and BVH construction/traversal. Also add test scenes of varying complexity and validate correctness against brute-force.
Completed
Week 3
Implemented naive OpenMP to build BVH and improved range-based BVH. Tested CPU parallelization (task-based recursion, granularity, load balancing) and gathered some early performance data.
In-progress
Week 4
Explore improved parallelization strategies for BVH construction, including better task granularity, alternative splitting approaches, and LBVH (Morton-code-based construction). Begin GPU implementation aligned with these strategies.
In-progress
Week 5
Finalize CPU and GPU parallelization implementation and also try to implement nice to haves. We will then collect benchmark results, analyze tradeoffs and then work on the final report and demo.
Planned
Poster Session Plan

At the poster session plan we are planning to present both virtual and performance-based results.

  • Rendered outputs of scenes like Stanford Bunny, Dragon, etc...
  • Comparsion between different parallelization strategies
  • Graphs showing BVH build time vs thread count, speedup, comparsion of different strategies
  • Discussion of limitations like scalability and optimization strategies
Preliminary Results

As we have already implemented two basic parallelization strategies for our bvh, we have these results, the BVH construction timings are as follows:

(All are for 800x450 resolution)

Dragon[1] (870k Triangles)
Threads Naive OMP (ms) Optimized OMP (ms)
14345343139
22432224019
41561015485
81173411524

Speedup was:

  • Naive: 3.72x for 8 threads
  • Optimized task based: 3.75x for 8 threads
Standford bunny[2] (70k Triangles)
Threads Naive OMP (ms) Optimized OMP (ms)
124992486
214191409
4937938
8803798

Speedup was 3.1x for 8 threads.

Observations
  • Parallelization does give us clear speedup but scaling is sublinear
  • Efficiency decreases as thread count increases too indicating a bottleneck
  • Optimized range-based BVH reduces memory overhead but doesn’t provide much improvement as our main problem for scaling is due to limited parallelization
  • Performance is dominated by recursive sorting cost, task overhead and limited parallelism at top levels of the bvh tree
  • This indicates that naive parallelization alone is insufficient.
Challenges & Open Issues

The main challenge we have encountered is efficient parallelization of BVH construction. While our intial implementations improve performance, scaling is limited as:

  • Task creation overhead
  • Sequential sorting
  • Load imbalance across subtress
  • Memory and cache inefficiency

We can implement optimized BVH construction by checking other strategies to build it such as LBVH so sequential part of the tree can be parallelized. But right now, we know the bottleneck is more algorithmic than purely implementation based.

At this point there are no major problems which are blocking our progress, and remaining work primarily involves trying different strategies for parallelization as well as efficient construction strategies for BVH. Then evaluating and understanding the tradeoffs.

Summary

We have successfully completed the baseline implementation and multiple BVH construction strategies. Our initial results are showing speedups as well as limitations of naive parallelization. These results then motivate further exploration of different improved algorithms like LBVH and other parallelization strategies as well as GPU based implementation. The project is progressing according to the plan and we are well on track to complete all remaining deliverables.

Datasets
[1] Stanford Bunny

Source: Stanford University Computer Graphics Laboratory
Link: https://graphics.stanford.edu/data/3Dscanrep/
Scanner: Cyberware 3030 MS
Number of scans: 10
Total size of scans: 362,272 points (about 725,000 triangles)
Reconstruction: zipper
Size of reconstruction: 35947 vertices, 69451 triangles
Comments: contains 5 holes in the bottom

[2] Stanford Dragon

Source: Stanford University Computer Graphics Laboratory
Scanner: Cyberware 3030 MS + spacetime analysis
Number of scans: ~70
Link: https://graphics.stanford.edu/data/3Dscanrep/
Total size of scans: 2,748,318 points (about 5,500,000 triangles)
Reconstruction: vrip (conservatively decimated)
Size of reconstruction: 566,098 vertices, 1,132,830 triangles
Comments: contains numerous small holes