Project Milestone Report | Parallel BVH Construction and Traversal

Work Completed So Far

We have implemented a ray tracing pipeline which supports both primitive (spheres and triangles) and mesh-based scenes. The code is implemented such that we can construct multiple scenes of varying complexity including small scenes like random 100 spheres and dataset such as Standford bunny and dragon. We also implemented a correctness validation pipeline using a feature flag “check” which helps compare based on PPM and ensure consistency so that we can validate if our parallel implementation are correct are not.

On top of the renderer, we have implemented a sequential BVH construction as well as traversal. This we completed using top down mid split recursive apporach to build the tree. As this is slow for big scenes like dragon, we started our parallel implementations. We have implemented OpenMP-based BVH constructions using task parallelism for recursion (left and right). We also implemented an optimized top-down BVH construction using a range based (start,end) representation for the BVH tree. This was to eliminate the vector copying happening every recursion.

Progress Toward Goals & Updated Deliverables

Our primary goal is to explore and evaulate different parallelization strategies for BVH construction and understand the scabilitity limitations and try identifying more parallel friendly base construction.

We are currently on track with respect to our deliverables as our main goal was to get the base ray tracing and correctness validation done properly so we can use it for our parallelization strategies. As all of this is done, we have begun implementing multiple parallelization strategies using CPU and GPU with optimizations as we discover new observations.

As we didnt mention properly in our original proposal report, our main focus is to discover how we can parallelize bvh construction and use multiple strategies taught to us in class, analyze them, and finally optimize it.

We still believe we will be able to deliver all our deliverables (which we updated on our website according to feedback). However, based on our early results, we decided to refine our approach for a staged exploration of BVH parallelization.

Updated Goals

Sequential ray tracer and BVH baseline
Naive OpenMP BVH construction
Optimized range-based OpenMP BVH constuction
Evaluate task-based parallel BVH construction (which is our current approach)
Explore task granularity tuning (threshold-based task spawning, still need to explore)
Analyze scaling behavior vs scene size and thread count
Compare coarse-grained vs fine-grained parallelism strategies
Measure impact of parallel overhead
Implement CPU traversal parallelization too
Evaluate independent ray parallelism vs BVH construction parallelism
Explore alternative BVH construction strategies (e.g., LBVH) (this is important for us as this might help us achieve much bettter parallelization)
Comparing original BVH vs parallel-friendly linear BVH approaches
Implement GPU-based BVH traversal and/or construction (CUDA)
Also implementing GPU-based for new alternative strategy
Evaluate GPU performance factors such as warp divergence and memory behavior
Perform comprehensive performance analysis comparing all of the tried strategies and understanding the tradeoffs.

Nice to haves

SIMD optimizations for traversal
Improved memory layout for better cache locality
Advanced GPU optimizations like reducing divergence, improving memory coalescing
Hybrid CPU-GPU strategy for BVH construction plus traversal (if our implementations are complimentary to each other, we will try combining them into a hybrid approach)

Updated Schedule

Week 1

Read research papers as well as any other relevant material on ray tracing and BVHs. Also understand how to implement sequential BVH construction and traversal. Then understand and check parallelization strategies for CPU and GPU. Finally, create a design with pseudocode.

Completed

Week 2

Implement and test baseline including ray tracing and BVH construction/traversal. Also add test scenes of varying complexity and validate correctness against brute-force.

Completed

Week 3

Implemented naive OpenMP to build BVH and improved range-based BVH. Tested CPU parallelization (task-based recursion, granularity, load balancing) and gathered some early performance data.

In-progress

Week 4

Explore improved parallelization strategies for BVH construction, including better task granularity, alternative splitting approaches, and LBVH (Morton-code-based construction). Begin GPU implementation aligned with these strategies.

In-progress

Week 5

Finalize CPU and GPU parallelization implementation and also try to implement nice to haves. We will then collect benchmark results, analyze tradeoffs and then work on the final report and demo.

Planned

Poster Session Plan

At the poster session plan we are planning to present both virtual and performance-based results.

Rendered outputs of scenes like Stanford Bunny, Dragon, etc...
Comparsion between different parallelization strategies
Graphs showing BVH build time vs thread count, speedup, comparsion of different strategies
Discussion of limitations like scalability and optimization strategies

Preliminary Results

As we have already implemented two basic parallelization strategies for our bvh, we have these results, the BVH construction timings are as follows:

(All are for 800x450 resolution)

Dragon[1] (870k Triangles)

Threads	Naive OMP (ms)	Optimized OMP (ms)
1	43453	43139
2	24322	24019
4	15610	15485
8	11734	11524

Speedup was:

Naive: 3.72x for 8 threads
Optimized task based: 3.75x for 8 threads

Standford bunny[2] (70k Triangles)

Threads	Naive OMP (ms)	Optimized OMP (ms)
1	2499	2486
2	1419	1409
4	937	938
8	803	798

Speedup was 3.1x for 8 threads.

Observations

Parallelization does give us clear speedup but scaling is sublinear
Efficiency decreases as thread count increases too indicating a bottleneck
Optimized range-based BVH reduces memory overhead but doesn’t provide much improvement as our main problem for scaling is due to limited parallelization
Performance is dominated by recursive sorting cost, task overhead and limited parallelism at top levels of the bvh tree
This indicates that naive parallelization alone is insufficient.

Challenges & Open Issues

The main challenge we have encountered is efficient parallelization of BVH construction. While our intial implementations improve performance, scaling is limited as:

Task creation overhead
Sequential sorting
Load imbalance across subtress
Memory and cache inefficiency

We can implement optimized BVH construction by checking other strategies to build it such as LBVH so sequential part of the tree can be parallelized. But right now, we know the bottleneck is more algorithmic than purely implementation based.

At this point there are no major problems which are blocking our progress, and remaining work primarily involves trying different strategies for parallelization as well as efficient construction strategies for BVH. Then evaluating and understanding the tradeoffs.

Summary

We have successfully completed the baseline implementation and multiple BVH construction strategies. Our initial results are showing speedups as well as limitations of naive parallelization. These results then motivate further exploration of different improved algorithms like LBVH and other parallelization strategies as well as GPU based implementation. The project is progressing according to the plan and we are well on track to complete all remaining deliverables.

Datasets

[1] Stanford Bunny

Source: Stanford University Computer Graphics Laboratory
Link: https://graphics.stanford.edu/data/3Dscanrep/
Scanner: Cyberware 3030 MS
Number of scans: 10
Total size of scans: 362,272 points (about 725,000 triangles)
Reconstruction: zipper
Size of reconstruction: 35947 vertices, 69451 triangles
Comments: contains 5 holes in the bottom

[2] Stanford Dragon

Source: Stanford University Computer Graphics Laboratory
Scanner: Cyberware 3030 MS + spacetime analysis
Number of scans: ~70
Link: https://graphics.stanford.edu/data/3Dscanrep/
Total size of scans: 2,748,318 points (about 5,500,000 triangles)
Reconstruction: vrip (conservatively decimated)
Size of reconstruction: 566,098 vertices, 1,132,830 triangles
Comments: contains numerous small holes