CIS565-Fall-2022 · conswang · Jan 5, 2023 · Jan 12, 2023 · Jan 12, 2023 · Jan 12, 2023
diff --git a/README.md b/README.md
@@ -3,11 +3,148 @@ CUDA Denoiser For CUDA Path Tracer
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+Constance Wang
+  * [LinkedIn](https://www.linkedin.com/in/conswang/)
 
-### (TODO: Your README)
+Tested on AORUS 15P XD laptop with specs:  
+- Windows 11 22000.856  
+- 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz 2.30 GHz  
+- NVIDIA GeForce RTX 3070 Laptop GPU  
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+This is an implementation of the [Edge-Avoiding À-Trous Wavelet Transform for fast Global
+Illumination Filtering](https://jo.dreggn.org/home/2010_atrous.pdf) in CUDA, integrated into a CUDA pathtracer.
 
+### Features
+- Gbuffer to store normals and positions of each pixel
+- Denoising pass that blurs pixels using an A-trous kernel, but avoids edges based on neighbouring pixels' ray-traced colour, normal, and position
+- Parameters: filterSize = 5 * 2^(# of iterations of A-trous), and colorWeight, normalWeight, positionWeight which correspond to the sigma parameter in the weight calculations from the paper for colors, normals, and positions respectively
+- Path-tracer integration: the bonus features and performance testing for this assignment were done in base code of this project. However, I also integrated the denoiser into my project 3 pathtracer to visually test more complex scenes, see the [proj-4-denoiser](https://github.com/conswang/Project3-CUDA-Path-Tracer/pull/1) branch
+- Extra credit
+  - Gaussian filter
+
+Showing the gbuffers as colours for cornell ceiling light scene (click "Show G-buffers" with `SHOW_GBUFFER_NORMALS` or `SHOW_GBUFFER_POS` macros set to 1).
+| Normals | Positions |
+| ---- | ----|
+| ![](img/box/normal-gbuffer.png) | ![](img/box/pos-gbuffer.png) |
+
+Denoiser on cornell ceiling light scene (10 samples) with filterSize = 80, colorWeight = 1.804, normalWeight = 0.309, positionWeight = 7.113. The a-trous only image shows the blur effect of the A-trous kernel only, without edge detection.
+
+|Original | A-trous only | Denoised with edge detection |
+| --- | ---| ---|
+| ![](img/box/orig.png) | ![](img/box/a-trous-only.png) | ![](img/box/denoised.png) |
+
+Denoiser tested on complex scene: [motorcycle.gltf](https://github.com/conswang/Project3-CUDA-Path-Tracer/blob/main/scenes/motorcycle/motorcycle.gltf) with filterSize = 320, colorWeight = 4, normalWeight = 1, positionWeight = 1.
+
+| Samples | Original | Denoised |
+|-----| ----- | ---- |
+| 10 | ![](img/motorcycle/10-samples-noisy.png) | ![](img/motorcycle/10-samples-denoised.png) |
+| 20 | ![](img/motorcycle/20-samples-noisy.png) | ![](img/motorcycle/20-samples-denoised.png) |
+| 50 | ![](img/motorcycle/50-samples-noisy.png) | ![](img/motorcycle/50-samples-denoised.png)
+| 100 | ![](img/motorcycle/100-samples-noisy.png) | ![](img/motorcycle/100-samples-denoised.png)
+
+### Gaussian Filter
+I also implemented an edge avoiding Gaussian filter using a hard-coded 11 x 11 kernel instead of A-trous. Set the flag `GAUSSIAN_KERNEL` to 1 to use the Gaussian kernel. The results are visually very similar, even without the weighted edge-avoidance.
+
+| Pure Gaussian (11 x 11 = 121) | Pure A-trous (filterSize 80) |
+| --- | --- |
+| ![](img/gaussian/gaussianonly.png) | ![](img/box/a-trous-only.png) |
+
+The kernels are visually almost indistinguishable, and the edge avoidance code works the same way, so the results are visually similar as well (using default params):
+
+| Gaussian | A-trous |
+|--|--|
+| ![](img/gaussian/gaussian-edge-avoid.png) | ![](img/box/denoised.png)|
+
+However, A-trous is much faster:
+
+![](img/graphs/Performance%20Gain%20of%20A-Trous%20over%20Gaussian%20Kernel%20(Lower%20is%20Better).png)
+(tested on cornell with ceiling, 10 iter, default params)
+
+### Visual Analysis
+
+For the motorcycle scene, it takes about 100 iterations to get a smooth result. Note that it's very hard to save the details on surfaces like the vending machine. This is because the normals of neighbouring pixels are very similar (the surface is almost flat), positions are similar (object is centered and directly faces the camera), and the colours are similar, so the overall blend weight is high. To preserve the edges on the coke bottle, we'd need sigma values so small that the denoising effect in other areas would be greatly reduced. In other words, a big drawback of edge-avoiding A-trous is that different objects would render better with different parameters, but we use a uniform filter size and weights across the image.
+
+For simpler scenes, it takes a lot less iterations, since we can just ramp up the weight values without losing too much detail. With less than 100 iterations, there tend to be some splotchy visual artifacts since there is still too much noise in the render to blur. 
+
+Avocado with filterSize = 80, colorWeight = 2, normalWeight = 0.12, positionWeight = 0.5:
+
+| Original (5000 samples) | Denoised (100 samples) |
+| --- | ---|
+ | ![](img/avocado/5000-samples-orig.png) | ![](img/avocado/100-samples-denoised.png) ||
+
+The denoiser looks less splotchy on images that have less noise in the first place... The cornell box with ceiling light has a larger light with lower intensity compared to the cornell box with a smaller light of higher intensity. Although the overall amount of light is similar, the smaller light causes more noise since there is a smaller chance of sampling a ray that hits it.
+
+10 samples for each:
+| Cornell (filterSize=80, colorWeight=10, normalWeight=0.221, positionWeight=1.768) | Cornell with ceiling light (filterSize = 80, colorWeight = 1.804, normalWeight = 0.309, positionWeight = 7.113) |
+|-- |--|
+|![](img/cornell/denoised.png)| ![](img/box/denoised.png) |
+
+#### Different Materials
+
+The denoiser works well enough for diffuse materials, since they should look very smooth in the first place. Same with perfectly specular materials, since they end up just being reflections of different diffuse surfaces.
+
+However, texture maps tend to be blurred too much even at very low colour/normal/position weights, as evidenced by the vending machine from the motorcycle render or this metal rectangle render:
+| 20 samples denoised | Original (2000 samples) |
+| ---|--|
+| ![](img/railing.png) | ![](img/metal-with-normal-texture.png) |
+
+#### Varying Filter Size
+
+Increasing the filter size makes the image smoother; however, filter sizes greater than 80 have less and less of an effect (most dramatic transitions from sizes 10-40). This is because the pixels end up so far apart that position weighting will greatly reduce the pixels' colour contribution, normals and colours may also be very different.
+
+Avocado scene with colorWeight = 2, normalWeight = 0.12, positionWeight = 0.5.
+| Filter size | Image |
+|--|--|
+| 10 | ![](img/avocado/20-samples-filtersize-10.png)|
+| 20 | ![](img/avocado/20-samples-filter-size-20.png) |
+| 40 | ![](img/avocado/20-samples-filtersize-40.png) |
+| 80 | ![](img/avocado/20-samples-filtersize-80.png) |
+| 160 | ![](img/avocado/20-samples-filtersize-160.png) |
+| 320 | ![](img/avocado/20-samples-filtersize-320.png) |
+
+### Performance Analysis
+
+When the `MEASURE_DENOISE_PERF` flag is set to 0, each iteration is denoised for more convenient debugging.  When set to 1, only the last path-traced iteration is denoised for a more accurate performance analysis. In the project 3 version of my denoiser (used for Avocado and motorcycle scenes), performance is always measured in the second way.
+
+I measured the total render time (from `pathtraceInit`, up to but not including `pathtraceFree`), the g-buffer initialization time, and the denoising time. The path-tracing time is calculated by subtracting g-buffer and denoise from the total render time.
+
+#### Denoising Runtime
+
+Denoising should have a very small effect on render time, since it runs in constant time in parallel on the GPU. We only need to generate the g buffers on the first bounce of the first iteration. Then, we only need to denoise once after raytracing is complete. Both steps launch kernels that run in constant time for each pixel in parallel. 
+
+The measurements show that denoising, including g-buffer generation are both very fast compared to path-tracing for 10 iterations. The results would be even more skewed as we increase the number of iterations.
+
+![](img/graphs/Effect%20of%20Denoising%20Step%20on%20Total%20Path-tracing%20Time.png)
+
+#### Varying Image Resolution
+
+We can also look at how the denoising time is affected by the image resolution. These results were tested on the cornell with ceiling light scene, with `filterSize = 80, colorWeight = 0.4, normalWeight = 0.35, positionWeight = 0.2`, and a block size of 16 x 16. G-buffer construction time is still negligible. I also implemented a version of the performance test where I grid-searched for the best block size (from 4 x 4 to 32 x 32), but found that the trend was almost exactly the same (see [graph](img/graphs/Effect%20of%20Increasing%20Image%20Resolution%20on%20Denoising%20Time%20with%20Variable%20Block%20Size.png)).
+
+| Resolution (pixels) | Denoising Time (seconds) | Percent of Time to Render 10 Iterations |
+| --|--|--|
+| 200 x 200| 0.0001851 | 0.13% |
+| 400 x 400| 0.0003627 | 0.24% |
+| 800 x 800| 0.001175 | 0.68% |
+| 1600 x 1600| 0.0043464 | 1.34% |
+| 3200 x 3200 |0.0159033 | 2.03% |
+
+Plotting the results shows that the denoising time increases almost perfectly linearly with respect to the number of pixels. In comparison, the other path-tracing steps do not scale linearly as resolution increases, so the total proportion of time spent denoising increases, making denoising slightly less efficient at higher resolutions.
+
+![](img/graphs/Effect%20of%20Increasing%20Image%20Resolution%20on%20Denoising%20Time%20(linear%20scale%2C%20block%20size%20%3D%2016%20x%2016).png)
+
+Through the very rigorous method of commenting out parts of the code and checking the run time, I found two sections that made the code extra slow:
+1. global memory access when getting gbuffer data at neighbouring pixels' indices
+2. calculating the edge avoidance weight (specifically, the exp function)
+
+#1 probably scales badly due to the increase in number of pixels that need to access neighbouring pixels' data from different blocks, so caching isn't as helpful. Without these two steps, the 3200 x 3200 resolution test would run about 10x faster.
+
+#### Varying Filter Size  
+
+Tested on cornell ceiling light scene with default color/normal/position weights, filter sizes = 10, 20, 40, 80, ... 640. 
+
+![](img/graphs/Effect%20of%20Filter%20Size%20on%20Denoising%20Time.png)
+
+Denoising time increases linearly with respect to log filter size. This makes sense, since filter size = 2 ^ (# of iterations) x 5, and denoising time should increase linearly as the number of A-trous iterations does.
+
+### Bloopers
+[are here](https://docs.google.com/document/d/1BJmclri4VJY_IXbsLU8Er_CQihQnfmzTQRi5cz9FthM/edit#heading=h.9whglgx4yoxx)
diff --git a/img/avocado/100-samples-denoised.png b/img/avocado/100-samples-denoised.png
diff --git a/img/avocado/20-samples-filter-size-20.png b/img/avocado/20-samples-filter-size-20.png
diff --git a/img/avocado/20-samples-filtersize-10.png b/img/avocado/20-samples-filtersize-10.png
diff --git a/img/avocado/20-samples-filtersize-160.png b/img/avocado/20-samples-filtersize-160.png
diff --git a/img/avocado/20-samples-filtersize-320.png b/img/avocado/20-samples-filtersize-320.png
diff --git a/img/avocado/20-samples-filtersize-40.png b/img/avocado/20-samples-filtersize-40.png
diff --git a/img/avocado/20-samples-filtersize-80.png b/img/avocado/20-samples-filtersize-80.png
diff --git a/img/avocado/5000-samples-orig.png b/img/avocado/5000-samples-orig.png
diff --git a/img/box/a-trous-only.png b/img/box/a-trous-only.png
diff --git a/img/box/denoised.png b/img/box/denoised.png
diff --git a/img/box/noisy-reflection.png b/img/box/noisy-reflection.png
diff --git a/img/box/normal-gbuffer.png b/img/box/normal-gbuffer.png
diff --git a/img/box/orig.png b/img/box/orig.png
diff --git a/img/box/pos-gbuffer.png b/img/box/pos-gbuffer.png
diff --git a/img/cornell/denoised.png b/img/cornell/denoised.png
diff --git a/img/cornell/diffuse.png b/img/cornell/diffuse.png
diff --git a/img/gaussian/gaussian-edge-avoid.png b/img/gaussian/gaussian-edge-avoid.png
diff --git a/img/gaussian/gaussianonly.png b/img/gaussian/gaussianonly.png
diff --git a/img/graphs/Effect of Denoising Step on Total Path-tracing Time.png b/img/graphs/Effect of Denoising Step on Total Path-tracing Time.png
diff --git a/img/graphs/Effect of Filter Size on Denoising Time.png b/img/graphs/Effect of Filter Size on Denoising Time.png
diff --git a/...ffect of Increasing Image Resolution on Denoising Time (block size = 8 x 8).png b/...ffect of Increasing Image Resolution on Denoising Time (block size = 8 x 8).png
diff --git a/...ing Image Resolution on Denoising Time (linear scale, block size = 16 x 16).png b/...ing Image Resolution on Denoising Time (linear scale, block size = 16 x 16).png
diff --git a/...t of Increasing Image Resolution on Denoising Time with Variable Block Size.png b/...t of Increasing Image Resolution on Denoising Time with Variable Block Size.png
diff --git a/img/graphs/Performance Gain of A-Trous over Gaussian Kernel (Lower is Better).png b/img/graphs/Performance Gain of A-Trous over Gaussian Kernel (Lower is Better).png
diff --git a/img/metal-with-normal-texture.png b/img/metal-with-normal-texture.png
diff --git a/img/motorcycle/10-samples-denoised.png b/img/motorcycle/10-samples-denoised.png
diff --git a/img/motorcycle/10-samples-noisy.png b/img/motorcycle/10-samples-noisy.png
diff --git a/img/motorcycle/100-samples-denoised.png b/img/motorcycle/100-samples-denoised.png
diff --git a/img/motorcycle/100-samples-noisy.png b/img/motorcycle/100-samples-noisy.png
diff --git a/img/motorcycle/20-samples-denoised.png b/img/motorcycle/20-samples-denoised.png
diff --git a/img/motorcycle/20-samples-noisy.png b/img/motorcycle/20-samples-noisy.png
diff --git a/img/motorcycle/50-samples-denoised.png b/img/motorcycle/50-samples-denoised.png
diff --git a/img/motorcycle/50-samples-noisy.png b/img/motorcycle/50-samples-noisy.png
diff --git a/img/motorcycle/5000-samples-ref.png b/img/motorcycle/5000-samples-ref.png
diff --git a/img/railing.png b/img/railing.png
diff --git a/scenes/cornell.txt b/scenes/cornell.txt
@@ -52,7 +52,7 @@ EMITTANCE   0
 CAMERA
 RES         800 800
 FOVY        45
-ITERATIONS  5000
+ITERATIONS  10
 DEPTH       8
 FILE        cornell
 EYE         0.0 5 10.5

diff --git a/src/main.cpp b/src/main.cpp
@@ -1,6 +1,7 @@
 #include "main.h"
 #include "preview.h"
 #include <cstring>
+#include <chrono>
 
 #include "../imgui/imgui.h"
 #include "../imgui/imgui_impl_glfw.h"
@@ -45,6 +46,8 @@ int iteration;
 int width;
 int height;
 
+std::chrono::system_clock::time_point pathtraceStart;
+
 //-------------------------------
 //-------------MAIN--------------
 //-------------------------------
@@ -99,6 +102,7 @@ int main(int argc, char** argv) {
 
 void saveImage() {
     float samples = iteration;
+
     // output image file
     image img(width, height);
 
@@ -151,6 +155,9 @@ void runCuda() {
 
     if (iteration == 0) {
         pathtraceFree();
+
+        pathtraceStart = std::chrono::system_clock::now(); // start timing pathtracer from first iter
+
         pathtraceInit(scene);
     }
 
@@ -171,6 +178,44 @@ void runCuda() {
       showImage(pbo_dptr, iteration);
     }
 
+    // only denoise at last iteration
+#if MEASURE_DENOISE_PERF
+    if (iteration == ui_iterations) {
+
+      auto pathtraceEnd = std::chrono::system_clock::now(); // includes g-buffer runtime
+      std::chrono::duration<double> pathtraceTime = pathtraceEnd - pathtraceStart;
+      std::cout << "Total path-trace run-time (seconds): " << pathtraceTime.count() << std::endl;
+
+      auto start = std::chrono::system_clock::now();
+
+#if GAUSSIAN_KERNEL
+      denoiseGaussianAndWriteToPbo(pbo_dptr, iteration, ui_colorWeight, ui_normalWeight, ui_positionWeight);
+#else
+      denoiseAndWriteToPbo(pbo_dptr, iteration, ui_filterSize, ui_colorWeight, ui_normalWeight, ui_positionWeight, glm::ivec2(16, 16));
+#endif
+
+      auto end = std::chrono::system_clock::now();
+      std::chrono::duration<double> elapsed_seconds = end - start;
+      std::cout << "Denoise run-time (seconds): " << elapsed_seconds.count() << std::endl;
+
+      std::cout << "Fraction of time spent on denoising: " << elapsed_seconds.count() / (elapsed_seconds.count() + pathtraceTime.count()) << std::endl;
+
+      std::cout << std::endl;
+
+      pathtraceFree();
+      cudaDeviceReset();
+      exit(EXIT_SUCCESS);
+    }
+#else
+    if (ui_denoise) {
+#if GAUSSIAN_KERNEL
+      denoiseGaussianAndWriteToPbo(pbo_dptr, iteration, ui_colorWeight, ui_normalWeight, ui_positionWeight);
+#else
+      denoiseAndWriteToPbo(pbo_dptr, iteration, ui_filterSize, ui_colorWeight, ui_normalWeight, ui_positionWeight, glm::ivec2(8, 8));
+#endif
+    }
+#endif
+
     // unmap buffer object
     cudaGLUnmapBufferObject(pbo);