We have implemented a ray tracing pipeline which supports both primitive (spheres and triangles) and mesh-based scenes. The code is implemented such that we can construct multiple scenes of varying complexity including small scenes like random 100 spheres and dataset such as Standford bunny and dragon. We also implemented a correctness validation pipeline using a feature flag “check” which helps compare based on PPM and ensure consistency so that we can validate if our parallel implementation are correct are not.
On top of the renderer, we have implemented a sequential BVH construction as well as traversal. This we completed using top down mid split recursive apporach to build the tree. As this is slow for big scenes like dragon, we started our parallel implementations. We have implemented OpenMP-based BVH constructions using task parallelism for recursion (left and right). We also implemented an optimized top-down BVH construction using a range based (start,end) representation for the BVH tree. This was to eliminate the vector copying happening every recursion.
Our primary goal is to explore and evaulate different parallelization strategies for BVH construction and understand the scabilitity limitations and try identifying more parallel friendly base construction.
We are currently on track with respect to our deliverables as our main goal was to get the base ray tracing and correctness validation done properly so we can use it for our parallelization strategies. As all of this is done, we have begun implementing multiple parallelization strategies using CPU and GPU with optimizations as we discover new observations.
As we didnt mention properly in our original proposal report, our main focus is to discover how we can parallelize bvh construction and use multiple strategies taught to us in class, analyze them, and finally optimize it.
We still believe we will be able to deliver all our deliverables (which we updated on our website according to feedback). However, based on our early results, we decided to refine our approach for a staged exploration of BVH parallelization.
- Sequential ray tracer and BVH baseline
- Naive OpenMP BVH construction
- Optimized range-based OpenMP BVH constuction
- Evaluate task-based parallel BVH construction (which is our current approach)
- Explore task granularity tuning (threshold-based task spawning, still need to explore)
- Analyze scaling behavior vs scene size and thread count
- Compare coarse-grained vs fine-grained parallelism strategies
- Measure impact of parallel overhead
- Implement CPU traversal parallelization too
- Evaluate independent ray parallelism vs BVH construction parallelism
- Explore alternative BVH construction strategies (e.g., LBVH) (this is important for us as this might help us achieve much bettter parallelization)
- Comparing original BVH vs parallel-friendly linear BVH approaches
- Implement GPU-based BVH traversal and/or construction (CUDA)
- Also implementing GPU-based for new alternative strategy
- Evaluate GPU performance factors such as warp divergence and memory behavior
- Perform comprehensive performance analysis comparing all of the tried strategies and understanding the tradeoffs.
- SIMD optimizations for traversal
- Improved memory layout for better cache locality
- Advanced GPU optimizations like reducing divergence, improving memory coalescing
- Hybrid CPU-GPU strategy for BVH construction plus traversal (if our implementations are complimentary to each other, we will try combining them into a hybrid approach)
At the poster session plan we are planning to present both virtual and performance-based results.
- Rendered outputs of scenes like Stanford Bunny, Dragon, etc...
- Comparsion between different parallelization strategies
- Graphs showing BVH build time vs thread count, speedup, comparsion of different strategies
- Discussion of limitations like scalability and optimization strategies
As we have already implemented two basic parallelization strategies for our bvh, we have these results, the BVH construction timings are as follows:
(All are for 800x450 resolution)
| Threads | Naive OMP (ms) | Optimized OMP (ms) |
|---|---|---|
| 1 | 43453 | 43139 |
| 2 | 24322 | 24019 |
| 4 | 15610 | 15485 |
| 8 | 11734 | 11524 |
Speedup was:
- Naive: 3.72x for 8 threads
- Optimized task based: 3.75x for 8 threads
| Threads | Naive OMP (ms) | Optimized OMP (ms) |
|---|---|---|
| 1 | 2499 | 2486 |
| 2 | 1419 | 1409 |
| 4 | 937 | 938 |
| 8 | 803 | 798 |
Speedup was 3.1x for 8 threads.
- Parallelization does give us clear speedup but scaling is sublinear
- Efficiency decreases as thread count increases too indicating a bottleneck
- Optimized range-based BVH reduces memory overhead but doesn’t provide much improvement as our main problem for scaling is due to limited parallelization
- Performance is dominated by recursive sorting cost, task overhead and limited parallelism at top levels of the bvh tree
- This indicates that naive parallelization alone is insufficient.
The main challenge we have encountered is efficient parallelization of BVH construction. While our intial implementations improve performance, scaling is limited as:
- Task creation overhead
- Sequential sorting
- Load imbalance across subtress
- Memory and cache inefficiency
We can implement optimized BVH construction by checking other strategies to build it such as LBVH so sequential part of the tree can be parallelized. But right now, we know the bottleneck is more algorithmic than purely implementation based.
At this point there are no major problems which are blocking our progress, and remaining work primarily involves trying different strategies for parallelization as well as efficient construction strategies for BVH. Then evaluating and understanding the tradeoffs.
We have successfully completed the baseline implementation and multiple BVH construction strategies. Our initial results are showing speedups as well as limitations of naive parallelization. These results then motivate further exploration of different improved algorithms like LBVH and other parallelization strategies as well as GPU based implementation. The project is progressing according to the plan and we are well on track to complete all remaining deliverables.
Source: Stanford University Computer Graphics Laboratory
Link: https://graphics.stanford.edu/data/3Dscanrep/
Scanner: Cyberware 3030 MS
Number of scans: 10
Total size of scans: 362,272 points (about 725,000 triangles)
Reconstruction: zipper
Size of reconstruction: 35947 vertices, 69451 triangles
Comments: contains 5 holes in the bottom
Source: Stanford University Computer Graphics Laboratory
Scanner: Cyberware 3030 MS + spacetime analysis
Number of scans: ~70
Link: https://graphics.stanford.edu/data/3Dscanrep/
Total size of scans: 2,748,318 points (about 5,500,000 triangles)
Reconstruction: vrip (conservatively decimated)
Size of reconstruction: 566,098 vertices, 1,132,830 triangles
Comments: contains numerous small holes