diff --git a/README.md b/README.md
index f044c821..89da089e 100644
--- a/README.md
+++ b/README.md
@@ -3,11 +3,148 @@ CUDA Denoiser For CUDA Path Tracer
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+Constance Wang
+  * [LinkedIn](https://www.linkedin.com/in/conswang/)
 
-### (TODO: Your README)
+Tested on AORUS 15P XD laptop with specs:  
+- Windows 11 22000.856  
+- 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz 2.30 GHz  
+- NVIDIA GeForce RTX 3070 Laptop GPU  
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+This is an implementation of the [Edge-Avoiding À-Trous Wavelet Transform for fast Global
+Illumination Filtering](https://jo.dreggn.org/home/2010_atrous.pdf) in CUDA, integrated into a CUDA pathtracer.
 
+### Features
+- Gbuffer to store normals and positions of each pixel
+- Denoising pass that blurs pixels using an A-trous kernel, but avoids edges based on neighbouring pixels' ray-traced colour, normal, and position
+- Parameters: filterSize = 5 * 2^(# of iterations of A-trous), and colorWeight, normalWeight, positionWeight which correspond to the sigma parameter in the weight calculations from the paper for colors, normals, and positions respectively
+- Path-tracer integration: the bonus features and performance testing for this assignment were done in base code of this project. However, I also integrated the denoiser into my project 3 pathtracer to visually test more complex scenes, see the [proj-4-denoiser](https://github.com/conswang/Project3-CUDA-Path-Tracer/pull/1) branch
+- Extra credit
+  - Gaussian filter
+
+Showing the gbuffers as colours for cornell ceiling light scene (click "Show G-buffers" with `SHOW_GBUFFER_NORMALS` or `SHOW_GBUFFER_POS` macros set to 1).
+| Normals | Positions |
+| ---- | ----|
+| ![](img/box/normal-gbuffer.png) | ![](img/box/pos-gbuffer.png) |
+
+Denoiser on cornell ceiling light scene (10 samples) with filterSize = 80, colorWeight = 1.804, normalWeight = 0.309, positionWeight = 7.113. The a-trous only image shows the blur effect of the A-trous kernel only, without edge detection.
+
+|Original | A-trous only | Denoised with edge detection |
+| --- | ---| ---|
+| ![](img/box/orig.png) | ![](img/box/a-trous-only.png) | ![](img/box/denoised.png) |
+
+Denoiser tested on complex scene: [motorcycle.gltf](https://github.com/conswang/Project3-CUDA-Path-Tracer/blob/main/scenes/motorcycle/motorcycle.gltf) with filterSize = 320, colorWeight = 4, normalWeight = 1, positionWeight = 1.
+
+| Samples | Original | Denoised |
+|-----| ----- | ---- |
+| 10 | ![](img/motorcycle/10-samples-noisy.png) | ![](img/motorcycle/10-samples-denoised.png) |
+| 20 | ![](img/motorcycle/20-samples-noisy.png) | ![](img/motorcycle/20-samples-denoised.png) |
+| 50 | ![](img/motorcycle/50-samples-noisy.png) | ![](img/motorcycle/50-samples-denoised.png)
+| 100 | ![](img/motorcycle/100-samples-noisy.png) | ![](img/motorcycle/100-samples-denoised.png)
+
+### Gaussian Filter
+I also implemented an edge avoiding Gaussian filter using a hard-coded 11 x 11 kernel instead of A-trous. Set the flag `GAUSSIAN_KERNEL` to 1 to use the Gaussian kernel. The results are visually very similar, even without the weighted edge-avoidance.
+
+| Pure Gaussian (11 x 11 = 121) | Pure A-trous (filterSize 80) |
+| --- | --- |
+| ![](img/gaussian/gaussianonly.png) | ![](img/box/a-trous-only.png) |
+
+The kernels are visually almost indistinguishable, and the edge avoidance code works the same way, so the results are visually similar as well (using default params):
+
+| Gaussian | A-trous |
+|--|--|
+| ![](img/gaussian/gaussian-edge-avoid.png) | ![](img/box/denoised.png)|
+
+However, A-trous is much faster:
+
+![](img/graphs/Performance%20Gain%20of%20A-Trous%20over%20Gaussian%20Kernel%20(Lower%20is%20Better).png)
+(tested on cornell with ceiling, 10 iter, default params)
+
+### Visual Analysis
+
+For the motorcycle scene, it takes about 100 iterations to get a smooth result. Note that it's very hard to save the details on surfaces like the vending machine. This is because the normals of neighbouring pixels are very similar (the surface is almost flat), positions are similar (object is centered and directly faces the camera), and the colours are similar, so the overall blend weight is high. To preserve the edges on the coke bottle, we'd need sigma values so small that the denoising effect in other areas would be greatly reduced. In other words, a big drawback of edge-avoiding A-trous is that different objects would render better with different parameters, but we use a uniform filter size and weights across the image.
+
+For simpler scenes, it takes a lot less iterations, since we can just ramp up the weight values without losing too much detail. With less than 100 iterations, there tend to be some splotchy visual artifacts since there is still too much noise in the render to blur. 
+
+Avocado with filterSize = 80, colorWeight = 2, normalWeight = 0.12, positionWeight = 0.5:
+
+| Original (5000 samples) | Denoised (100 samples) |
+| --- | ---|
+ | ![](img/avocado/5000-samples-orig.png) | ![](img/avocado/100-samples-denoised.png) ||
+
+The denoiser looks less splotchy on images that have less noise in the first place... The cornell box with ceiling light has a larger light with lower intensity compared to the cornell box with a smaller light of higher intensity. Although the overall amount of light is similar, the smaller light causes more noise since there is a smaller chance of sampling a ray that hits it.
+
+10 samples for each:
+| Cornell (filterSize=80, colorWeight=10, normalWeight=0.221, positionWeight=1.768) | Cornell with ceiling light (filterSize = 80, colorWeight = 1.804, normalWeight = 0.309, positionWeight = 7.113) |
+|-- |--|
+|![](img/cornell/denoised.png)| ![](img/box/denoised.png) |
+
+#### Different Materials
+
+The denoiser works well enough for diffuse materials, since they should look very smooth in the first place. Same with perfectly specular materials, since they end up just being reflections of different diffuse surfaces.
+
+However, texture maps tend to be blurred too much even at very low colour/normal/position weights, as evidenced by the vending machine from the motorcycle render or this metal rectangle render:
+| 20 samples denoised | Original (2000 samples) |
+| ---|--|
+| ![](img/railing.png) | ![](img/metal-with-normal-texture.png) |
+
+#### Varying Filter Size
+
+Increasing the filter size makes the image smoother; however, filter sizes greater than 80 have less and less of an effect (most dramatic transitions from sizes 10-40). This is because the pixels end up so far apart that position weighting will greatly reduce the pixels' colour contribution, normals and colours may also be very different.
+
+Avocado scene with colorWeight = 2, normalWeight = 0.12, positionWeight = 0.5.
+| Filter size | Image |
+|--|--|
+| 10 | ![](img/avocado/20-samples-filtersize-10.png)|
+| 20 | ![](img/avocado/20-samples-filter-size-20.png) |
+| 40 | ![](img/avocado/20-samples-filtersize-40.png) |
+| 80 | ![](img/avocado/20-samples-filtersize-80.png) |
+| 160 | ![](img/avocado/20-samples-filtersize-160.png) |
+| 320 | ![](img/avocado/20-samples-filtersize-320.png) |
+
+### Performance Analysis
+
+When the `MEASURE_DENOISE_PERF` flag is set to 0, each iteration is denoised for more convenient debugging.  When set to 1, only the last path-traced iteration is denoised for a more accurate performance analysis. In the project 3 version of my denoiser (used for Avocado and motorcycle scenes), performance is always measured in the second way.
+
+I measured the total render time (from `pathtraceInit`, up to but not including `pathtraceFree`), the g-buffer initialization time, and the denoising time. The path-tracing time is calculated by subtracting g-buffer and denoise from the total render time.
+
+#### Denoising Runtime
+
+Denoising should have a very small effect on render time, since it runs in constant time in parallel on the GPU. We only need to generate the g buffers on the first bounce of the first iteration. Then, we only need to denoise once after raytracing is complete. Both steps launch kernels that run in constant time for each pixel in parallel. 
+
+The measurements show that denoising, including g-buffer generation are both very fast compared to path-tracing for 10 iterations. The results would be even more skewed as we increase the number of iterations.
+
+![](img/graphs/Effect%20of%20Denoising%20Step%20on%20Total%20Path-tracing%20Time.png)
+
+#### Varying Image Resolution
+
+We can also look at how the denoising time is affected by the image resolution. These results were tested on the cornell with ceiling light scene, with `filterSize = 80, colorWeight = 0.4, normalWeight = 0.35, positionWeight = 0.2`, and a block size of 16 x 16. G-buffer construction time is still negligible. I also implemented a version of the performance test where I grid-searched for the best block size (from 4 x 4 to 32 x 32), but found that the trend was almost exactly the same (see [graph](img/graphs/Effect%20of%20Increasing%20Image%20Resolution%20on%20Denoising%20Time%20with%20Variable%20Block%20Size.png)).
+
+| Resolution (pixels) | Denoising Time (seconds) | Percent of Time to Render 10 Iterations |
+| --|--|--|
+| 200 x 200| 0.0001851 | 0.13% |
+| 400 x 400| 0.0003627 | 0.24% |
+| 800 x 800| 0.001175 | 0.68% |
+| 1600 x 1600| 0.0043464 | 1.34% |
+| 3200 x 3200 |0.0159033 | 2.03% |
+
+Plotting the results shows that the denoising time increases almost perfectly linearly with respect to the number of pixels. In comparison, the other path-tracing steps do not scale linearly as resolution increases, so the total proportion of time spent denoising increases, making denoising slightly less efficient at higher resolutions.
+
+![](img/graphs/Effect%20of%20Increasing%20Image%20Resolution%20on%20Denoising%20Time%20(linear%20scale%2C%20block%20size%20%3D%2016%20x%2016).png)
+
+Through the very rigorous method of commenting out parts of the code and checking the run time, I found two sections that made the code extra slow:
+1. global memory access when getting gbuffer data at neighbouring pixels' indices
+2. calculating the edge avoidance weight (specifically, the exp function)
+
+#1 probably scales badly due to the increase in number of pixels that need to access neighbouring pixels' data from different blocks, so caching isn't as helpful. Without these two steps, the 3200 x 3200 resolution test would run about 10x faster.
+
+#### Varying Filter Size  
+
+Tested on cornell ceiling light scene with default color/normal/position weights, filter sizes = 10, 20, 40, 80, ... 640. 
+
+![](img/graphs/Effect%20of%20Filter%20Size%20on%20Denoising%20Time.png)
+
+Denoising time increases linearly with respect to log filter size. This makes sense, since filter size = 2 ^ (# of iterations) x 5, and denoising time should increase linearly as the number of A-trous iterations does.
+
+### Bloopers
+[are here](https://docs.google.com/document/d/1BJmclri4VJY_IXbsLU8Er_CQihQnfmzTQRi5cz9FthM/edit#heading=h.9whglgx4yoxx)
diff --git a/img/avocado/100-samples-denoised.png b/img/avocado/100-samples-denoised.png
new file mode 100644
index 00000000..90e74498
Binary files /dev/null and b/img/avocado/100-samples-denoised.png differ
diff --git a/img/avocado/20-samples-filter-size-20.png b/img/avocado/20-samples-filter-size-20.png
new file mode 100644
index 00000000..c1f50a82
Binary files /dev/null and b/img/avocado/20-samples-filter-size-20.png differ
diff --git a/img/avocado/20-samples-filtersize-10.png b/img/avocado/20-samples-filtersize-10.png
new file mode 100644
index 00000000..8aa27d62
Binary files /dev/null and b/img/avocado/20-samples-filtersize-10.png differ
diff --git a/img/avocado/20-samples-filtersize-160.png b/img/avocado/20-samples-filtersize-160.png
new file mode 100644
index 00000000..18e8db5a
Binary files /dev/null and b/img/avocado/20-samples-filtersize-160.png differ
diff --git a/img/avocado/20-samples-filtersize-320.png b/img/avocado/20-samples-filtersize-320.png
new file mode 100644
index 00000000..87a5ca62
Binary files /dev/null and b/img/avocado/20-samples-filtersize-320.png differ
diff --git a/img/avocado/20-samples-filtersize-40.png b/img/avocado/20-samples-filtersize-40.png
new file mode 100644
index 00000000..5dd07f3d
Binary files /dev/null and b/img/avocado/20-samples-filtersize-40.png differ
diff --git a/img/avocado/20-samples-filtersize-80.png b/img/avocado/20-samples-filtersize-80.png
new file mode 100644
index 00000000..5dd07f3d
Binary files /dev/null and b/img/avocado/20-samples-filtersize-80.png differ
diff --git a/img/avocado/5000-samples-orig.png b/img/avocado/5000-samples-orig.png
new file mode 100644
index 00000000..35e248b5
Binary files /dev/null and b/img/avocado/5000-samples-orig.png differ
diff --git a/img/box/a-trous-only.png b/img/box/a-trous-only.png
new file mode 100644
index 00000000..acac5b72
Binary files /dev/null and b/img/box/a-trous-only.png differ
diff --git a/img/box/denoised.png b/img/box/denoised.png
new file mode 100644
index 00000000..70cfbbaa
Binary files /dev/null and b/img/box/denoised.png differ
diff --git a/img/box/noisy-reflection.png b/img/box/noisy-reflection.png
new file mode 100644
index 00000000..ea188ba4
Binary files /dev/null and b/img/box/noisy-reflection.png differ
diff --git a/img/box/normal-gbuffer.png b/img/box/normal-gbuffer.png
new file mode 100644
index 00000000..22931ead
Binary files /dev/null and b/img/box/normal-gbuffer.png differ
diff --git a/img/box/orig.png b/img/box/orig.png
new file mode 100644
index 00000000..163c3199
Binary files /dev/null and b/img/box/orig.png differ
diff --git a/img/box/pos-gbuffer.png b/img/box/pos-gbuffer.png
new file mode 100644
index 00000000..618775c3
Binary files /dev/null and b/img/box/pos-gbuffer.png differ
diff --git a/img/cornell/denoised.png b/img/cornell/denoised.png
new file mode 100644
index 00000000..13403491
Binary files /dev/null and b/img/cornell/denoised.png differ
diff --git a/img/cornell/diffuse.png b/img/cornell/diffuse.png
new file mode 100644
index 00000000..7b602fd0
Binary files /dev/null and b/img/cornell/diffuse.png differ
diff --git a/img/gaussian/gaussian-edge-avoid.png b/img/gaussian/gaussian-edge-avoid.png
new file mode 100644
index 00000000..4516a15f
Binary files /dev/null and b/img/gaussian/gaussian-edge-avoid.png differ
diff --git a/img/gaussian/gaussianonly.png b/img/gaussian/gaussianonly.png
new file mode 100644
index 00000000..50cb8b43
Binary files /dev/null and b/img/gaussian/gaussianonly.png differ
diff --git a/img/graphs/Effect of Denoising Step on Total Path-tracing Time.png b/img/graphs/Effect of Denoising Step on Total Path-tracing Time.png
new file mode 100644
index 00000000..7db710f0
Binary files /dev/null and b/img/graphs/Effect of Denoising Step on Total Path-tracing Time.png differ
diff --git a/img/graphs/Effect of Filter Size on Denoising Time.png b/img/graphs/Effect of Filter Size on Denoising Time.png
new file mode 100644
index 00000000..fa10f7ab
Binary files /dev/null and b/img/graphs/Effect of Filter Size on Denoising Time.png differ
diff --git a/img/graphs/Effect of Increasing Image Resolution on Denoising Time (block size = 8 x 8).png b/img/graphs/Effect of Increasing Image Resolution on Denoising Time (block size = 8 x 8).png
new file mode 100644
index 00000000..0d6d1d77
Binary files /dev/null and b/img/graphs/Effect of Increasing Image Resolution on Denoising Time (block size = 8 x 8).png differ
diff --git a/img/graphs/Effect of Increasing Image Resolution on Denoising Time (linear scale, block size = 16 x 16).png b/img/graphs/Effect of Increasing Image Resolution on Denoising Time (linear scale, block size = 16 x 16).png
new file mode 100644
index 00000000..32d56543
Binary files /dev/null and b/img/graphs/Effect of Increasing Image Resolution on Denoising Time (linear scale, block size = 16 x 16).png differ
diff --git a/img/graphs/Effect of Increasing Image Resolution on Denoising Time with Variable Block Size.png b/img/graphs/Effect of Increasing Image Resolution on Denoising Time with Variable Block Size.png
new file mode 100644
index 00000000..227b374b
Binary files /dev/null and b/img/graphs/Effect of Increasing Image Resolution on Denoising Time with Variable Block Size.png differ
diff --git a/img/graphs/Performance Gain of A-Trous over Gaussian Kernel (Lower is Better).png b/img/graphs/Performance Gain of A-Trous over Gaussian Kernel (Lower is Better).png
new file mode 100644
index 00000000..50dd4479
Binary files /dev/null and b/img/graphs/Performance Gain of A-Trous over Gaussian Kernel (Lower is Better).png differ
diff --git a/img/metal-with-normal-texture.png b/img/metal-with-normal-texture.png
new file mode 100644
index 00000000..84fd3769
Binary files /dev/null and b/img/metal-with-normal-texture.png differ
diff --git a/img/motorcycle/10-samples-denoised.png b/img/motorcycle/10-samples-denoised.png
new file mode 100644
index 00000000..a23da689
Binary files /dev/null and b/img/motorcycle/10-samples-denoised.png differ
diff --git a/img/motorcycle/10-samples-noisy.png b/img/motorcycle/10-samples-noisy.png
new file mode 100644
index 00000000..e8015c4a
Binary files /dev/null and b/img/motorcycle/10-samples-noisy.png differ
diff --git a/img/motorcycle/100-samples-denoised.png b/img/motorcycle/100-samples-denoised.png
new file mode 100644
index 00000000..bd94ba01
Binary files /dev/null and b/img/motorcycle/100-samples-denoised.png differ
diff --git a/img/motorcycle/100-samples-noisy.png b/img/motorcycle/100-samples-noisy.png
new file mode 100644
index 00000000..f5948c60
Binary files /dev/null and b/img/motorcycle/100-samples-noisy.png differ
diff --git a/img/motorcycle/20-samples-denoised.png b/img/motorcycle/20-samples-denoised.png
new file mode 100644
index 00000000..cc9e4212
Binary files /dev/null and b/img/motorcycle/20-samples-denoised.png differ
diff --git a/img/motorcycle/20-samples-noisy.png b/img/motorcycle/20-samples-noisy.png
new file mode 100644
index 00000000..b4dd6d9e
Binary files /dev/null and b/img/motorcycle/20-samples-noisy.png differ
diff --git a/img/motorcycle/50-samples-denoised.png b/img/motorcycle/50-samples-denoised.png
new file mode 100644
index 00000000..efe67d01
Binary files /dev/null and b/img/motorcycle/50-samples-denoised.png differ
diff --git a/img/motorcycle/50-samples-noisy.png b/img/motorcycle/50-samples-noisy.png
new file mode 100644
index 00000000..79f981be
Binary files /dev/null and b/img/motorcycle/50-samples-noisy.png differ
diff --git a/img/motorcycle/5000-samples-ref.png b/img/motorcycle/5000-samples-ref.png
new file mode 100644
index 00000000..e0f88ebd
Binary files /dev/null and b/img/motorcycle/5000-samples-ref.png differ
diff --git a/img/railing.png b/img/railing.png
new file mode 100644
index 00000000..5092ea3b
Binary files /dev/null and b/img/railing.png differ
diff --git a/scenes/cornell.txt b/scenes/cornell.txt
index 83ff8202..77ad5512 100644
--- a/scenes/cornell.txt
+++ b/scenes/cornell.txt
@@ -52,7 +52,7 @@ EMITTANCE   0
 CAMERA
 RES         800 800
 FOVY        45
-ITERATIONS  5000
+ITERATIONS  10
 DEPTH       8
 FILE        cornell
 EYE         0.0 5 10.5
diff --git a/src/main.cpp b/src/main.cpp
index 4092ae4a..c22018c9 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -1,6 +1,7 @@
 #include "main.h"
 #include "preview.h"
 #include <cstring>
+#include <chrono>
 
 #include "../imgui/imgui.h"
 #include "../imgui/imgui_impl_glfw.h"
@@ -45,6 +46,8 @@ int iteration;
 int width;
 int height;
 
+std::chrono::system_clock::time_point pathtraceStart;
+
 //-------------------------------
 //-------------MAIN--------------
 //-------------------------------
@@ -99,6 +102,7 @@ int main(int argc, char** argv) {
 
 void saveImage() {
     float samples = iteration;
+
     // output image file
     image img(width, height);
 
@@ -151,6 +155,9 @@ void runCuda() {
 
     if (iteration == 0) {
         pathtraceFree();
+
+        pathtraceStart = std::chrono::system_clock::now(); // start timing pathtracer from first iter
+
         pathtraceInit(scene);
     }
 
@@ -171,6 +178,44 @@ void runCuda() {
       showImage(pbo_dptr, iteration);
     }
 
+    // only denoise at last iteration
+#if MEASURE_DENOISE_PERF
+    if (iteration == ui_iterations) {
+
+      auto pathtraceEnd = std::chrono::system_clock::now(); // includes g-buffer runtime
+      std::chrono::duration<double> pathtraceTime = pathtraceEnd - pathtraceStart;
+      std::cout << "Total path-trace run-time (seconds): " << pathtraceTime.count() << std::endl;
+
+      auto start = std::chrono::system_clock::now();
+
+#if GAUSSIAN_KERNEL
+      denoiseGaussianAndWriteToPbo(pbo_dptr, iteration, ui_colorWeight, ui_normalWeight, ui_positionWeight);
+#else
+      denoiseAndWriteToPbo(pbo_dptr, iteration, ui_filterSize, ui_colorWeight, ui_normalWeight, ui_positionWeight, glm::ivec2(16, 16));
+#endif
+
+      auto end = std::chrono::system_clock::now();
+      std::chrono::duration<double> elapsed_seconds = end - start;
+      std::cout << "Denoise run-time (seconds): " << elapsed_seconds.count() << std::endl;
+
+      std::cout << "Fraction of time spent on denoising: " << elapsed_seconds.count() / (elapsed_seconds.count() + pathtraceTime.count()) << std::endl;
+
+      std::cout << std::endl;
+
+      pathtraceFree();
+      cudaDeviceReset();
+      exit(EXIT_SUCCESS);
+    }
+#else
+    if (ui_denoise) {
+#if GAUSSIAN_KERNEL
+      denoiseGaussianAndWriteToPbo(pbo_dptr, iteration, ui_colorWeight, ui_normalWeight, ui_positionWeight);
+#else
+      denoiseAndWriteToPbo(pbo_dptr, iteration, ui_filterSize, ui_colorWeight, ui_normalWeight, ui_positionWeight, glm::ivec2(8, 8));
+#endif
+    }
+#endif
+
     // unmap buffer object
     cudaGLUnmapBufferObject(pbo);
 
diff --git a/src/pathtrace.cu b/src/pathtrace.cu
index 23e5f909..a53820c6 100644
--- a/src/pathtrace.cu
+++ b/src/pathtrace.cu
@@ -4,6 +4,7 @@
 #include <thrust/execution_policy.h>
 #include <thrust/random.h>
 #include <thrust/remove.h>
+#include <chrono>
 
 #include "sceneStructs.h"
 #include "scene.h"
@@ -14,6 +15,9 @@
 #include "intersections.h"
 #include "interactions.h"
 
+#define SHOW_GBUFFER_NORMALS 0
+#define SHOW_GBUFFER_POS 1
+
 #define ERRORCHECK 1
 
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
@@ -67,18 +71,24 @@ __global__ void sendImageToPBO(uchar4* pbo, glm::ivec2 resolution,
     }
 }
 
+__device__ uchar4 vec3ToColor(glm::vec3 v) {
+  glm::vec3 col = glm::clamp(glm::abs(256.f * v), 0.f, 255.f);
+  return make_uchar4(col.x, col.y, col.z, 0);
+}
+
 __global__ void gbufferToPBO(uchar4* pbo, glm::ivec2 resolution, GBufferPixel* gBuffer) {
     int x = (blockIdx.x * blockDim.x) + threadIdx.x;
     int y = (blockIdx.y * blockDim.y) + threadIdx.y;
 
     if (x < resolution.x && y < resolution.y) {
         int index = x + (y * resolution.x);
-        float timeToIntersect = gBuffer[index].t * 256.0;
 
-        pbo[index].w = 0;
-        pbo[index].x = timeToIntersect;
-        pbo[index].y = timeToIntersect;
-        pbo[index].z = timeToIntersect;
+#if SHOW_GBUFFER_NORMALS
+        pbo[index] = vec3ToColor(gBuffer[index].normal);
+#elif SHOW_GBUFFER_POS
+        // scale down positions
+        pbo[index] = vec3ToColor(gBuffer[index].position * 0.1f);
+#endif
     }
 }
 
@@ -91,6 +101,11 @@ static ShadeableIntersection * dev_intersections = NULL;
 static GBufferPixel* dev_gBuffer = NULL;
 // TODO: static variables for device memory, any extra info you need, etc
 // ...
+static glm::vec3* dev_image_denoised_in = NULL; // ping pong
+static glm::vec3* dev_image_denoised_out = NULL;
+static glm::ivec2 *dev_offset = NULL;
+static float *dev_kernel = NULL;
+static float* dev_gaussian = NULL;
 
 void pathtraceInit(Scene *scene) {
     hst_scene = scene;
@@ -114,6 +129,51 @@ void pathtraceInit(Scene *scene) {
     cudaMalloc(&dev_gBuffer, pixelcount * sizeof(GBufferPixel));
 
     // TODO: initialize any extra device memeory you need
+    cudaMalloc(&dev_image_denoised_in, pixelcount * sizeof(glm::vec3));
+    cudaMemset(dev_image_denoised_in, 0, pixelcount * sizeof(glm::vec3));
+
+    cudaMalloc(&dev_image_denoised_out, pixelcount * sizeof(glm::vec3));
+    cudaMemset(dev_image_denoised_out, 0, pixelcount * sizeof(glm::vec3));
+
+    cudaMalloc(&dev_offset, 25 * sizeof(glm::ivec2));
+    glm::ivec2 offset[25];
+    for (int i = 0, int y = 0; y < 5; ++y) { // read array from left to right, top to bottom
+      for (int x = 0; x < 5; ++x) {
+        offset[i++] = glm::ivec2(x - 2, y - 2);
+      }
+    }
+    cudaMemcpy(dev_offset, offset, 25 * sizeof(glm::ivec2), cudaMemcpyHostToDevice);
+
+    cudaMalloc(&dev_kernel, 25 * sizeof(float));
+    float kernel[25] =
+    { 1.f / 256, 1.f / 64, 3.f / 128, 1.f / 64, 1.f / 256,
+      1.f / 64, 1.f / 16, 3.f / 32, 1.f / 16, 1.f / 64,
+      3.f / 128, 3.f / 32, 9.f / 64, 3.f / 32, 3.f / 128,
+      1.f / 64, 1.f / 16, 3.f / 32, 1.f / 16, 1.f / 64,
+      1.f / 256, 1.f / 64, 3.f / 128, 1.f / 64, 1.f / 256 };
+    cudaMemcpy(dev_kernel, kernel, 25 * sizeof(float), cudaMemcpyHostToDevice);
+
+    cudaMalloc(&dev_gaussian, 121 * sizeof(float));
+
+    // I typed this once before in cis 460 and I will never type this again in my life
+    double gaussian[121] = {
+      0.006849,	0.007239,	0.007559,	0.007795,	0.007941,	0.00799 ,       0.007941,	0.007795,	0.007559,	0.007239,	0.006849,
+      0.007239,	0.007653,	0.00799 ,        0.00824,       0.008394,	0.008446,	0.008394,	0.00824 ,        0.00799,       0.007653,	0.007239,
+      0.007559,	0.00799	,       0.008342,	0.008604,	0.008764,	0.008819,	0.008764,	0.008604,	0.008342,	0.00799 ,        0.007559,
+      0.007795,	0.00824	,       0.008604,	0.008873,	0.009039,	0.009095,	0.009039,	0.008873,	0.008604,	0.00824 ,        0.007795,
+      0.007941,	0.008394,	0.008764,	0.009039,	0.009208,	0.009265,	0.009208,	0.009039,	0.008764,	0.008394,	0.007941,
+      0.00799 ,   0.008446,	0.008819,	0.009095,	0.009265,	0.009322,	0.009265,	0.009095,	0.008819,	0.008446,	0.00799 ,
+      0.007941,	0.008394,	0.008764,	0.009039,	0.009208,	0.009265,	0.009208,	0.009039,	0.008764,	0.008394,	0.007941,
+      0.007795,	0.00824	,       0.008604,	0.008873,	0.009039,	0.009095,	0.009039,	0.008873,	0.008604,	0.00824 ,        0.007795,
+      0.007559,	0.00799	,       0.008342,	0.008604,	0.008764,	0.008819,	0.008764,	0.008604,	0.008342,	0.00799 ,        0.007559,
+      0.007239,	0.007653,	0.00799 ,        0.00824,       0.008394,	0.008446,	0.008394,	0.00824 ,        0.00799,       0.007653,	0.007239,
+      0.006849,	0.007239,	0.007559,	0.007795,	0.007941,	0.00799 ,       0.007941,	0.007795,	0.007559,	0.007239,	0.006849
+    };
+    float gaussian_float[121];
+    for (int i = 0; i < 121; ++i) {
+      gaussian_float[i] = gaussian[i];
+    }
+    cudaMemcpy(dev_gaussian, gaussian_float, 121 * sizeof(float), cudaMemcpyHostToDevice);
 
     checkCUDAError("pathtraceInit");
 }
@@ -126,6 +186,11 @@ void pathtraceFree() {
   	cudaFree(dev_intersections);
     cudaFree(dev_gBuffer);
     // TODO: clean up any extra device memory you created
+    cudaFree(dev_image_denoised_in);
+    cudaFree(dev_image_denoised_out);
+    cudaFree(dev_offset);
+    cudaFree(dev_kernel);
+    cudaFree(dev_gaussian);
 
     checkCUDAError("pathtraceFree");
 }
@@ -281,7 +346,17 @@ __global__ void generateGBuffer (
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   if (idx < num_paths)
   {
-    gBuffer[idx].t = shadeableIntersections[idx].t;
+    auto& intersect = shadeableIntersections[idx];
+    gBuffer[idx].normal = intersect.surfaceNormal;
+
+    if (intersect.t < 0) {
+      // Position doesn't matter too much since the colour is black anyway
+      gBuffer[idx].position = glm::vec3(0);
+    }
+    else {
+      auto& ray = pathSegments[idx].ray;
+      gBuffer[idx].position = ray.origin + ray.direction * intersect.t;
+    }
   }
 }
 
@@ -378,9 +453,21 @@ void pathtrace(int frame, int iter) {
 	checkCUDAError("trace one bounce");
 	cudaDeviceSynchronize();
 
+#if MEASURE_DENOISE_PERF
+  if (depth == 0 && iter == 1) {
+    auto start = std::chrono::system_clock::now();
+
+    generateGBuffer << <numblocksPathSegmentTracing, blockSize1d >> > (num_paths, dev_intersections, dev_paths, dev_gBuffer);
+
+    auto end = std::chrono::system_clock::now();
+    std::chrono::duration<double> elapsed_seconds = end - start;
+    std::cout << "G-buffer generation run-time (seconds): " << elapsed_seconds.count() << std::endl;
+  }
+#else
   if (depth == 0) {
-    generateGBuffer<<<numblocksPathSegmentTracing, blockSize1d>>>(num_paths, dev_intersections, dev_paths, dev_gBuffer);
+    generateGBuffer << <numblocksPathSegmentTracing, blockSize1d >> > (num_paths, dev_intersections, dev_paths, dev_gBuffer);
   }
+#endif
 
 	depth++;
 
@@ -431,3 +518,206 @@ const Camera &cam = hst_scene->state.camera;
     // Send results to OpenGL buffer for rendering
     sendImageToPBO<<<blocksPerGrid2d, blockSize2d>>>(pbo, cam.resolution, iter, dev_image);
 }
+
+__global__ void kernInitDenoiseBuffer(glm::vec3* image, glm::ivec2 resolution, float pathtraceIter, glm::vec3* image_denoised) {
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+  if (!(x < resolution.x && y < resolution.y)) {
+    return;
+  }
+  int index = x + (y * resolution.x);
+  image_denoised[index] = image[index] / pathtraceIter;
+}
+
+__device__ __forceinline__ float getWeight(glm::vec3 v1, glm::vec3 v2, float sigma) {
+  glm::vec3 t = v1 - v2;
+  float dist_squared = glm::dot(t, t);
+  return glm::min(exp(-dist_squared / (sigma * sigma)), 1.0f);
+}
+
+__global__ void kernDenoise(
+  glm::ivec2 resolution,
+  GBufferPixel *gBuffer, 
+  int stepWidth,
+  float *kernel,
+  glm::ivec2 *offset,
+  float colorWeight,
+  float normalWeight,
+  float positionWeight,
+  glm::vec3 *image_denoised_in,
+  glm::vec3 *image_denoised_out
+) {
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+
+  if (x >= resolution.x || y >= resolution.y) {
+    return;
+  }
+
+  int index = x + (y * resolution.x);
+
+  auto &color = image_denoised_in[index];
+  auto& position = gBuffer[index].position;
+  auto& normal = gBuffer[index].normal;
+
+  float cum_w = 0.0f;
+  glm::vec3 sum(0.f);
+
+  for (int i = 0; i < 25; ++i) {
+    glm::ivec2 neighbourIdx = glm::ivec2(x, y) + offset[i] * stepWidth;
+
+    if (neighbourIdx.x >= 0 && neighbourIdx.x < resolution.x
+      && neighbourIdx.y >= 0 && neighbourIdx.y < resolution.y) {
+
+      int n = neighbourIdx.x + (neighbourIdx.y * resolution.x);
+
+      auto& neighbourColor = image_denoised_in[n];
+      auto& neighbourPos = gBuffer[n].position;
+      auto& neighbourNorm = gBuffer[n].normal;
+
+      float c_w = getWeight(color, neighbourColor, colorWeight);
+      float p_w = getWeight(position, neighbourPos, positionWeight);
+      float n_w = getWeight(normal, neighbourNorm, normalWeight);
+
+      float weight = c_w * n_w * p_w;
+      sum += kernel[i] * weight * neighbourColor;
+      cum_w += kernel[i] * weight;
+    }
+  }
+
+  image_denoised_out[index] = sum / cum_w;
+}
+
+void denoiseAndWriteToPbo(
+  uchar4* pbo,
+  int pathtraceIter,
+  int filterSize,
+  float colorWeight,
+  float normalWeight,
+  float positionWeight,
+  glm::ivec2 blockSize
+) {
+  const Camera& cam = hst_scene->state.camera;
+  const dim3 blockSize2d(blockSize.x, blockSize.y);
+  const dim3 blocksPerGrid2d(
+    (cam.resolution.x + blockSize2d.x - 1) / blockSize2d.x,
+    (cam.resolution.y + blockSize2d.y - 1) / blockSize2d.y);
+
+  kernInitDenoiseBuffer << <blocksPerGrid2d, blockSize2d >> > (dev_image, cam.resolution, pathtraceIter, dev_image_denoised_in);
+
+  // filter size is size of window on the last iteration
+  int numDenoiseIters = glm::log2(filterSize / 5);
+  int stepWidth = 1;
+
+  for (int i = 0; i < numDenoiseIters; ++i) {
+    kernDenoise << <blocksPerGrid2d, blockSize2d >> > (
+      cam.resolution,
+      dev_gBuffer,
+      stepWidth,
+      dev_kernel,
+      dev_offset,
+      colorWeight,
+      normalWeight,
+      positionWeight,
+      dev_image_denoised_in,
+      dev_image_denoised_out);
+
+    // filter doubles every iter
+    stepWidth = stepWidth << 2;
+    // At each pass we set sigma rt = 2^{-i} * sigma_rt
+    // allowing for smaller illumination variations to be smoothed
+    colorWeight = colorWeight / stepWidth;
+
+    std::swap(dev_image_denoised_in, dev_image_denoised_out); // most updated version is _in now
+  }
+#if !MEASURE_DENOISE_PERF
+  sendImageToPBO << <blocksPerGrid2d, blockSize2d >> > (pbo, cam.resolution, 1, dev_image_denoised_in);
+
+  cudaMemcpy(hst_scene->state.image.data(), dev_image_denoised_in,
+    cam.resolution.x * cam.resolution.y * sizeof(glm::vec3), cudaMemcpyDeviceToHost);
+#endif
+  cudaDeviceSynchronize();
+}
+
+__global__ void kernDenoiseGaussian(
+  glm::ivec2 resolution,
+  GBufferPixel* gBuffer,
+  float *gaussianKernel,
+  float colorWeight,
+  float normalWeight,
+  float positionWeight,
+  glm::vec3* image_denoised_in,
+  glm::vec3* image_denoised_out
+) {
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+
+  if (x >= resolution.x || y >= resolution.y) {
+    return;
+  }
+
+  int index = x + (y * resolution.x);
+
+  auto& color = image_denoised_in[index];
+  auto& position = gBuffer[index].position;
+  auto& normal = gBuffer[index].normal;
+
+  float cum_w = 0.0f;
+  glm::vec3 sum(0.f);
+
+  for (int i = glm::max(0, x - 5); i <= glm::min(resolution.x - 1, x + 5); i++) {
+    for (int j = glm::max(0, y - 5); j <= glm::min(resolution.y - 1, y + 5); j++) {
+      int n = i + j * resolution.x;
+
+      auto& neighbourColor = image_denoised_in[n];
+      auto& neighbourPos = gBuffer[n].position;
+      auto& neighbourNorm = gBuffer[n].normal;
+
+      float c_w = getWeight(color, neighbourColor, colorWeight);
+      float p_w = getWeight(position, neighbourPos, positionWeight);
+      float n_w = getWeight(normal, neighbourNorm, normalWeight);
+
+      float weight = c_w * n_w * p_w;
+
+      // i =  x - 5, x - 4 ... x + 5,
+      // i - x = -5, ... 5
+      // i - x + 5 = 0, 1, 2, ... 10
+      int gaussianIdx = (i - x + 5) + (j - y + 5) * 11;
+
+      sum += gaussianKernel[gaussianIdx] * weight * neighbourColor;
+      cum_w += gaussianKernel[gaussianIdx] * weight;
+    }
+  }
+
+  image_denoised_out[index] = sum / cum_w;
+}
+
+void denoiseGaussianAndWriteToPbo(
+  uchar4* pbo,
+  int pathtraceIter,
+  float colorWeight,
+  float normalWeight,
+  float positionWeight
+) {
+  const Camera& cam = hst_scene->state.camera;
+  const dim3 blockSize2d(16, 16);
+  const dim3 blocksPerGrid2d(
+    (cam.resolution.x + blockSize2d.x - 1) / blockSize2d.x,
+    (cam.resolution.y + blockSize2d.y - 1) / blockSize2d.y);
+
+  kernInitDenoiseBuffer << <blocksPerGrid2d, blockSize2d >> > (dev_image, cam.resolution, pathtraceIter, dev_image_denoised_in);
+
+  cudaDeviceSynchronize();
+
+  kernDenoiseGaussian << <blocksPerGrid2d, blockSize2d >> > (cam.resolution, dev_gBuffer, dev_gaussian,
+    colorWeight, normalWeight, positionWeight, dev_image_denoised_in, dev_image_denoised_out);
+
+#if !MEASURE_DENOISE_PERF
+  sendImageToPBO << <blocksPerGrid2d, blockSize2d >> > (pbo, cam.resolution, 1, dev_image_denoised_out);
+
+  cudaMemcpy(hst_scene->state.image.data(), dev_image_denoised_out,
+    cam.resolution.x * cam.resolution.y * sizeof(glm::vec3), cudaMemcpyDeviceToHost);
+#endif
+
+  cudaDeviceSynchronize();
+}
\ No newline at end of file
diff --git a/src/pathtrace.h b/src/pathtrace.h
index 9e12f440..f53d8fa8 100644
--- a/src/pathtrace.h
+++ b/src/pathtrace.h
@@ -8,3 +8,11 @@ void pathtraceFree();
 void pathtrace(int frame, int iteration);
 void showGBuffer(uchar4 *pbo);
 void showImage(uchar4 *pbo, int iter);
+void denoiseAndWriteToPbo(uchar4* pbo, int iteration, int filterSize, float colorWeight, float normalWeight, float positionWeight, glm::ivec2 blockSize);
+void denoiseGaussianAndWriteToPbo(
+  uchar4* pbo,
+  int pathtraceIter,
+  float colorWeight,
+  float normalWeight,
+  float positionWeight
+);
\ No newline at end of file
diff --git a/src/sceneStructs.h b/src/sceneStructs.h
index da7e558a..ceb5456a 100644
--- a/src/sceneStructs.h
+++ b/src/sceneStructs.h
@@ -6,6 +6,8 @@
 #include "glm/glm.hpp"
 
 #define BACKGROUND_COLOR (glm::vec3(0.0f))
+#define MEASURE_DENOISE_PERF 1
+#define GAUSSIAN_KERNEL 0
 
 enum GeomType {
     SPHERE,
@@ -78,5 +80,6 @@ struct ShadeableIntersection {
 // CHECKITOUT - a simple struct for storing scene geometry information per-pixel.
 // What information might be helpful for guiding a denoising filter?
 struct GBufferPixel {
-  float t;
+  glm::vec3 normal;
+  glm::vec3 position; // todo: store t value instead and reconstruct position based on camera
 };