diff --git a/README.md b/README.md
index f044c821..9c3e32e3 100644
--- a/README.md
+++ b/README.md
@@ -3,11 +3,73 @@ CUDA Denoiser For CUDA Path Tracer
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* XiaoyuDu
+* Tested on: Windows 10, i9-11900KF @ 3.50GHz, RTX 3080 (Personal PC)
 
-### (TODO: Your README)
+  
+### Description  
+This project built a denoiser based on an GPU-based path tracer. The algorithm of the denoiser follows the mthods proposed in the paper "Edge-Avoiding À-Trous Wavelet Transform for fast Global Illumination Filtering" by Holger Dammertz, Daniel Sewtz, Johannes Hanika, Hendrik P.A. Lensch. 
+  
+  
+### Feature  
+* I implemented all the features for part 2.  
+* I implemented G-buffer optimization to switch the position of G-buffer from 3 float(glm::vec3) to 1 float(z-depth).  
+* I implemented a complete 7x7 gaussian kernel and a GUI checkbox to switch between A-trous kernel and gaussian kernel.  
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+### Analysis  
+* Runtime for Denoiser  
+I tested the denoiser with 10 iterations of the cornell box scene. Averagely speaking, after adding the denoiser, the runtime for each iteration increased for about 1.11242ms.  
+* Influence on the number of iterations  
+I tested the influence of denoiser with filter size of 20, color weight of 12, normal weight of 0.02, and position weight of 0.34. The results are shown below. As you can see, originally, we need around 500 iterations to get an "acceptably smooth" result. With this denoiser, only around 60 iterations are needed to get an "acceptably smooth" result.  
 
+|  Denoiser on | Denoiser off |
+| ----------------- | ----------------- |
+| 1 iteration | 30 iterations |
+| ![](images/cornell_ceiling/1_d.png) | ![](images/cornell_ceiling/29.png)  |
+| 6 iterations | 100 iterations |
+| ![](images/cornell_ceiling/6_d.png) | ![](images/cornell_ceiling/99.png)  |
+| 60 iterations | 500 iterations |
+| ![](images/cornell_ceiling/60_d.png) | ![](images/cornell_ceiling/500.png)  |
+  
+* Impact of Different Resolutions on Runtime  
+Below is a chart that compare the runtime with different resolutions. The runtime are counted in millisecond for 50 iterations. The resolution are tested from 1000x1000 to 2500x2500. Note, the increase in resolutions is not linear, so the increase in the difference of runtime is also not linear. However, we can notice from the overall trends that the runtime increase should be linear with respect to number pixels.  
+![](images/runtime_vs_resolution.png)  
+
+* Impact of filter sizes on Runtime  
+The impact of filter sizes on runtime are tested with a resolution of 1000x1000 and iterations of 50. The runtime are counted in millisecond for 50 iterations. The filter size are tested from 10x10 to 80x80. I am expecting the runtime increase as filter size increase. The result generally shows this trends and also a linear relationship with respect to the filter size.  
+![](images/filtersize_vs_runtime.png)  
+
+* Impact of filter sizes on Image  
+Generally speaking, the image quality will increase as the filter size increase. In the example below, I tested the impact of different filter sizes with 10 iterations, color weight of 3.9, normal weight of 0.01, position weight of 0.025. As shown in the result, the quality of image looks better with higher filter size. When the filter size is above 40, I can't see the difference between the images I got anymore.
+
+|  FilterSize = 10 | FilterSize = 20 | FilterSize = 40 |
+| ----------------- | ----------------- | --- |
+| ![](images/cornell_ceiling/filtersize/10.png) | ![](images/cornell_ceiling/filtersize/20.png) | ![](images/cornell_ceiling/filtersize/40.png) |
+
+* Effectiveness of this method on different materials  
+I tested the effectiveness of this method on lambert, ideal reflective and idea refractive materials. I change the colors of all materials to be yellow and tested each scene with 50 iterations, filter size of 20, color weight of 3.9, normal weight of 0.01, position weight of 0.025. The results are shown below. Based on the result, I think the method works well on all these three materials.
+
+|  |  Denoiser on | Denoiser off |
+| --- | ----------------- | ----------------- |
+| lambert | ![](images/cornell_ceiling_2/d_lambert.png) | ![](images/cornell_ceiling_2/lambert.png)  |
+| Ideal reflective | ![](images/cornell_ceiling_2/d_reflective.png)  | ![](images/cornell_ceiling_2/reflective.png)  |
+| Ideal refractive | ![](images/cornell_ceiling_2/d_refractive.png) | ![](images/cornell_ceiling_2/refractive.png) |
+
+* Results across Different Scenes  
+I tested the result between the "cornell.txt" scene which is the standard cornell box scene and the "cornell_ceiling_light.txt" scene which is the standard cornell box scene with a much larger ceiling light. I tested the two scenes with 50 iterations, filter size of 20, color weight of 10, normal weight of 0.01, position weight of 0.025. Note here I choose color weight of 10 because a higher color weight is needed for generating a good results for the "cornell.txt" scene. The result are shown below. From the results, we can see that the scene with a larger light converge to a much better result than the scene with standard light. The reason behind is that with a larger light, each ray is more likely to hit the light source and thus are more likely to provide useful and correct color information for this pixel. With standard cornell box scene, many rays may not hit the light source for many iterations and thus still give only black color for a pixel, and thus won't provide good result when we try to average the image with denoiser. One better solution is to switch our integrator to full-light integrator which will actively gather information from light sources instead of the naive integrator we currently have.  
+
+| standard cornell box scene | cornell box scene with larger light |
+| ------ | -- |
+| ![](images/cornell_ceiling_1and2/cornell.png) | ![](images/cornell_ceiling_1and2/cornell_ceiling.png) |
+
+### Extra Features  
+* G-buffer Optimization  
+I also implemented G-buffer optimization that uses Z-depth instead of glm::vec3 for positions in the G-buffer. On the GUI, I added a checkbox called "using Z for position". You can check it to switch to z-depth mode and uncheck it to switch back to vec3 mode. I compared the difference between the two modes and the results are shown below. The result below are tested with 50 iterations, resolutions from 1000x1000 to 3000x3000. We can see that the runtime of z-depth mode are slightly smaller than vec3 mode for all resolution and it also seems to be linear to the resolution.  
+![](images/zdepth_vs_vec3.png)  
+
+* Gaussian kernel  
+I also implemented a complete 7x7 gaussian kernel. Since gaussian kernel is different for different kernel dimension I just choose one dimension to implement which is 7x7. On the GUI, I added a checkbox called "gaussian" which can be checked to switch between gaussian kernel and A-trous kernel. Below is a compare result with gaussian kernel and A-trous. I set the filter size as 7, iterations as 50. Now, no matter how I adjust the parameters, the results from gaussian kernel and A-trous all look really similar. So I think that A-trous is a quite good approximate to gaussian kernel. 
+
+| gaussian kernel | A-trous kernel |
+| --- | --- | 
+| ![](images/gaussian/gaussian.png)  | ![](images/gaussian/no_gaussian.png) |
diff --git a/images/cornell_ceiling/1_d.png b/images/cornell_ceiling/1_d.png
new file mode 100644
index 00000000..d65fee46
Binary files /dev/null and b/images/cornell_ceiling/1_d.png differ
diff --git a/images/cornell_ceiling/29.png b/images/cornell_ceiling/29.png
new file mode 100644
index 00000000..cf60b236
Binary files /dev/null and b/images/cornell_ceiling/29.png differ
diff --git a/images/cornell_ceiling/500.png b/images/cornell_ceiling/500.png
new file mode 100644
index 00000000..3d63f2a2
Binary files /dev/null and b/images/cornell_ceiling/500.png differ
diff --git a/images/cornell_ceiling/60_d.png b/images/cornell_ceiling/60_d.png
new file mode 100644
index 00000000..046dc2cc
Binary files /dev/null and b/images/cornell_ceiling/60_d.png differ
diff --git a/images/cornell_ceiling/6_d.png b/images/cornell_ceiling/6_d.png
new file mode 100644
index 00000000..a100265e
Binary files /dev/null and b/images/cornell_ceiling/6_d.png differ
diff --git a/images/cornell_ceiling/99.png b/images/cornell_ceiling/99.png
new file mode 100644
index 00000000..789502dc
Binary files /dev/null and b/images/cornell_ceiling/99.png differ
diff --git a/images/cornell_ceiling/filtersize/10.png b/images/cornell_ceiling/filtersize/10.png
new file mode 100644
index 00000000..79d228f0
Binary files /dev/null and b/images/cornell_ceiling/filtersize/10.png differ
diff --git a/images/cornell_ceiling/filtersize/20.png b/images/cornell_ceiling/filtersize/20.png
new file mode 100644
index 00000000..60e21fc9
Binary files /dev/null and b/images/cornell_ceiling/filtersize/20.png differ
diff --git a/images/cornell_ceiling/filtersize/40.png b/images/cornell_ceiling/filtersize/40.png
new file mode 100644
index 00000000..949ebecc
Binary files /dev/null and b/images/cornell_ceiling/filtersize/40.png differ
diff --git a/images/cornell_ceiling_1and2/cornell.png b/images/cornell_ceiling_1and2/cornell.png
new file mode 100644
index 00000000..7be2e323
Binary files /dev/null and b/images/cornell_ceiling_1and2/cornell.png differ
diff --git a/images/cornell_ceiling_1and2/cornell_ceiling.png b/images/cornell_ceiling_1and2/cornell_ceiling.png
new file mode 100644
index 00000000..85ff9739
Binary files /dev/null and b/images/cornell_ceiling_1and2/cornell_ceiling.png differ
diff --git a/images/cornell_ceiling_2/d_lambert.png b/images/cornell_ceiling_2/d_lambert.png
new file mode 100644
index 00000000..b70b2070
Binary files /dev/null and b/images/cornell_ceiling_2/d_lambert.png differ
diff --git a/images/cornell_ceiling_2/d_reflective.png b/images/cornell_ceiling_2/d_reflective.png
new file mode 100644
index 00000000..c6b5489f
Binary files /dev/null and b/images/cornell_ceiling_2/d_reflective.png differ
diff --git a/images/cornell_ceiling_2/d_refractive.png b/images/cornell_ceiling_2/d_refractive.png
new file mode 100644
index 00000000..cf052b53
Binary files /dev/null and b/images/cornell_ceiling_2/d_refractive.png differ
diff --git a/images/cornell_ceiling_2/lambert.png b/images/cornell_ceiling_2/lambert.png
new file mode 100644
index 00000000..c325eeca
Binary files /dev/null and b/images/cornell_ceiling_2/lambert.png differ
diff --git a/images/cornell_ceiling_2/reflective.png b/images/cornell_ceiling_2/reflective.png
new file mode 100644
index 00000000..30fb3c56
Binary files /dev/null and b/images/cornell_ceiling_2/reflective.png differ
diff --git a/images/cornell_ceiling_2/refractive.png b/images/cornell_ceiling_2/refractive.png
new file mode 100644
index 00000000..315f6e25
Binary files /dev/null and b/images/cornell_ceiling_2/refractive.png differ
diff --git a/images/filtersize_vs_runtime.png b/images/filtersize_vs_runtime.png
new file mode 100644
index 00000000..a6b24d29
Binary files /dev/null and b/images/filtersize_vs_runtime.png differ
diff --git a/images/gaussian/gaussian.png b/images/gaussian/gaussian.png
new file mode 100644
index 00000000..8910db66
Binary files /dev/null and b/images/gaussian/gaussian.png differ
diff --git a/images/gaussian/no_gaussian.png b/images/gaussian/no_gaussian.png
new file mode 100644
index 00000000..0b92e0df
Binary files /dev/null and b/images/gaussian/no_gaussian.png differ
diff --git a/images/runtime_vs_resolution.png b/images/runtime_vs_resolution.png
new file mode 100644
index 00000000..0ecc182f
Binary files /dev/null and b/images/runtime_vs_resolution.png differ
diff --git a/images/zdepth_vs_vec3.png b/images/zdepth_vs_vec3.png
new file mode 100644
index 00000000..e75e1296
Binary files /dev/null and b/images/zdepth_vs_vec3.png differ
diff --git a/scenes/cornell.txt b/scenes/cornell.txt
index 83ff8202..0d788bc4 100644
--- a/scenes/cornell.txt
+++ b/scenes/cornell.txt
@@ -50,9 +50,9 @@ EMITTANCE   0
 
 // Camera
 CAMERA
-RES         800 800
+RES         1000 1000
 FOVY        45
-ITERATIONS  5000
+ITERATIONS  50
 DEPTH       8
 FILE        cornell
 EYE         0.0 5 10.5
diff --git a/scenes/cornell_ceiling_light.txt b/scenes/cornell_ceiling_light.txt
index 15af5f19..26ed6a0b 100644
--- a/scenes/cornell_ceiling_light.txt
+++ b/scenes/cornell_ceiling_light.txt
@@ -50,9 +50,9 @@ EMITTANCE   0
 
 // Camera
 CAMERA
-RES         800 800
+RES        1000 1000
 FOVY        45
-ITERATIONS  10
+ITERATIONS  50
 DEPTH       8
 FILE        cornell
 EYE         0.0 5 10.5
diff --git a/scenes/cornell_ceiling_light2.txt b/scenes/cornell_ceiling_light2.txt
new file mode 100644
index 00000000..e2562026
--- /dev/null
+++ b/scenes/cornell_ceiling_light2.txt
@@ -0,0 +1,148 @@
+// Emissive material (light)
+MATERIAL 0
+RGB         1 1 1
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   1
+
+// Diffuse white
+MATERIAL 1
+RGB         .98 .98 .98
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+// Diffuse red
+MATERIAL 2
+RGB         .85 .35 .35
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+// Diffuse green
+MATERIAL 3
+RGB         .35 .85 .35
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+// reflectance
+MATERIAL 4
+RGB         .98 .98 .98
+SPECEX      0
+SPECRGB     .98 .98 .98
+REFL        1
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+// yellow refractive
+MATERIAL 5
+RGB         .98 .98 .8
+SPECEX      0
+SPECRGB     .98 .98 .98
+REFL        0
+REFR        1
+REFRIOR     0
+EMITTANCE   0
+
+// yellow reflectance
+MATERIAL 6
+RGB         .98 .98 .8
+SPECEX      0
+SPECRGB     .98 .98 .98
+REFL        1
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+// yellow lambert
+MATERIAL 7
+RGB         .98 .98 .8
+SPECEX      0
+SPECRGB     .98 .98 .98
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+
+// Camera
+CAMERA
+RES        1000 1000
+FOVY        45
+ITERATIONS  40
+DEPTH       8
+FILE        cornell
+EYE         0.0 5 10.5
+LOOKAT      0 5 0
+UP          0 1 0
+
+
+// Ceiling light
+OBJECT 0
+cube
+material 0
+TRANS       0 10 0
+ROTAT       0 0 0
+SCALE       10 .3 10
+
+// Floor
+OBJECT 1
+cube
+material 1
+TRANS       0 0 0
+ROTAT       0 0 0
+SCALE       10 .01 10
+
+// Ceiling
+OBJECT 2
+cube
+material 1
+TRANS       0 10 0
+ROTAT       0 0 90
+SCALE       .01 10 10
+
+// Back wall
+OBJECT 3
+cube
+material 1
+TRANS       0 5 -5
+ROTAT       0 90 0
+SCALE       .01 10 10
+
+// Left wall
+OBJECT 4
+cube
+material 2
+TRANS       -5 5 0
+ROTAT       0 0 0
+SCALE       .01 10 10
+
+// Right wall
+OBJECT 5
+cube
+material 3
+TRANS       5 5 0
+ROTAT       0 0 0
+SCALE       .01 10 10
+
+// Sphere
+OBJECT 6
+sphere
+material 5
+TRANS       -1 4 -1
+ROTAT       0 0 0
+SCALE       3 3 3
diff --git a/src/interactions.h b/src/interactions.h
index 144a9f5b..32f1cf0d 100644
--- a/src/interactions.h
+++ b/src/interactions.h
@@ -53,7 +53,11 @@ void scatterRay(
     glm::vec3 newDirection;
     if (m.hasReflective) {
         newDirection = glm::reflect(pathSegment.ray.direction, normal);
-    } else {
+    }
+    else if (m.hasRefractive) {
+        newDirection = pathSegment.ray.direction;
+    }
+    else {
         newDirection = calculateRandomDirectionInHemisphere(normal, rng);
     }
 
diff --git a/src/main.cpp b/src/main.cpp
index 4092ae4a..4b2e2e56 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -23,12 +23,24 @@ int ui_iterations = 0;
 int startupIterations = 0;
 int lastLoopIterations = 0;
 bool ui_showGbuffer = false;
-bool ui_denoise = false;
-int ui_filterSize = 80;
-float ui_colorWeight = 0.45f;
-float ui_normalWeight = 0.35f;
-float ui_positionWeight = 0.2f;
+bool currDenoiser = false;
+bool ui_denoiser = false;
+int currFilterSize = 20;
+int ui_filterSize = 20;
+//float currColorWeight = 3.9f;
+//float ui_colorWeight = 3.9f;
+float currColorWeight = 10.f;
+float ui_colorWeight = 10.f;
+float currNormalWeight = 0.01f;
+float ui_normalWeight = 0.01f;
+float currPositionWeight = 0.025f;
+float ui_positionWeight = 0.025f;
 bool ui_saveAndExit = false;
+float ui_timer = 0.f;
+bool currUseZforPos = false;
+bool ui_useZforPos = false;
+bool currGaussian = false;
+bool ui_gaussian = false;
 
 static bool camchanged = true;
 static float dtheta = 0, dphi = 0;
@@ -110,10 +122,15 @@ void saveImage() {
         }
     }
 
-    std::string filename = renderState->imageName;
+    /*std::string filename = renderState->imageName;
     std::ostringstream ss;
     ss << filename << "." << startTimeString << "." << samples << "samp";
-    filename = ss.str();
+    filename = ss.str();*/
+    
+    std::string filename = std::to_string(lastLoopIterations);
+    if (currDenoiser) {
+        filename += ("_" + std::to_string(currColorWeight) + "_" + std::to_string(currNormalWeight) + "_" + std::to_string(currPositionWeight) + "_");
+    }
 
     // CHECKITOUT
     img.savePNG(filename);
@@ -126,6 +143,38 @@ void runCuda() {
       camchanged = true;
     }
 
+    if (currFilterSize != ui_filterSize) {
+        currFilterSize = ui_filterSize;
+        camchanged = true;
+    }
+    if (currColorWeight != ui_colorWeight) {
+        currColorWeight = ui_colorWeight;
+        camchanged = true;
+    }
+    if (currNormalWeight != ui_normalWeight) {
+        currNormalWeight = ui_normalWeight;
+        camchanged = true;
+    }
+    if (currPositionWeight != ui_positionWeight) {
+        currPositionWeight = ui_positionWeight;
+        camchanged = true;
+    }
+
+    if (currDenoiser != ui_denoiser) {
+        currDenoiser = ui_denoiser;
+        camchanged = true;
+    }
+
+    if (currUseZforPos != ui_useZforPos) {
+        currUseZforPos = ui_useZforPos;
+        camchanged = true;
+    }
+
+    if (currGaussian != ui_gaussian) {
+        currGaussian = ui_gaussian;
+        camchanged = true;
+    }
+
     if (camchanged) {
         iteration = 0;
         Camera &cam = renderState->camera;
@@ -144,25 +193,43 @@ void runCuda() {
         cameraPosition += cam.lookAt;
         cam.position = cameraPosition;
         camchanged = false;
+
+        //cout << cam.position.x << endl;
+        //cout << cam.position.y << endl;
+        //cout << cam.position.z << endl;
+        //cout << cam.lookAt.x << endl;
+        //cout << cam.lookAt.y << endl;
+        //cout << cam.lookAt.z << endl;
+        ui_timer = 0.f;
       }
 
     // Map OpenGL buffer object for writing from CUDA on a single GPU
     // No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not use this buffer
 
     if (iteration == 0) {
-        pathtraceFree();
-        pathtraceInit(scene);
+        pathtraceFree(currUseZforPos, currGaussian);
+        pathtraceInit(scene, currUseZforPos, currGaussian);
     }
 
     uchar4 *pbo_dptr = NULL;
     cudaGLMapBufferObject((void**)&pbo_dptr, pbo);
 
     if (iteration < ui_iterations) {
-        iteration++;
+        ++iteration;
 
+        cudaEvent_t event_start;
+        cudaEvent_t event_end;
+        cudaEventCreate(&event_start);
+        cudaEventCreate(&event_end);
+        cudaEventRecord(event_start);
         // execute the kernel
         int frame = 0;
-        pathtrace(frame, iteration);
+        pathtrace(frame, iteration, ui_filterSize, ui_colorWeight, ui_normalWeight, ui_positionWeight, ui_denoiser, currUseZforPos, currGaussian);
+        cudaEventRecord(event_end);
+        cudaEventSynchronize(event_end);
+        float timeElapsedMilliseconds;
+        cudaEventElapsedTime(&timeElapsedMilliseconds, event_start, event_end);
+        ui_timer += timeElapsedMilliseconds;
     }
 
     if (ui_showGbuffer) {
@@ -176,7 +243,7 @@ void runCuda() {
 
     if (ui_saveAndExit) {
         saveImage();
-        pathtraceFree();
+        pathtraceFree(currUseZforPos, currGaussian);
         cudaDeviceReset();
         exit(EXIT_SUCCESS);
     }
diff --git a/src/main.h b/src/main.h
index 06d311a8..1c2accb3 100644
--- a/src/main.h
+++ b/src/main.h
@@ -41,6 +41,10 @@ extern float ui_colorWeight;
 extern float ui_normalWeight;
 extern float ui_positionWeight;
 extern bool ui_saveAndExit;
+extern bool ui_denoiser;
+extern float ui_timer;
+extern bool ui_useZforPos;
+extern bool ui_gaussian;
 
 void runCuda();
 void keyCallback(GLFWwindow *window, int key, int scancode, int action, int mods);
diff --git a/src/pathtrace.cu b/src/pathtrace.cu
index 23e5f909..d0e34187 100644
--- a/src/pathtrace.cu
+++ b/src/pathtrace.cu
@@ -4,6 +4,9 @@
 #include <thrust/execution_policy.h>
 #include <thrust/random.h>
 #include <thrust/remove.h>
+#include <thrust/partition.h>
+#include "device_launch_parameters.h"
+
 
 #include "sceneStructs.h"
 #include "scene.h"
@@ -73,12 +76,26 @@ __global__ void gbufferToPBO(uchar4* pbo, glm::ivec2 resolution, GBufferPixel* g
 
     if (x < resolution.x && y < resolution.y) {
         int index = x + (y * resolution.x);
-        float timeToIntersect = gBuffer[index].t * 256.0;
 
+        //show intersection t
+        //float timeToIntersect = gBuffer[index].t * 256.0;
+        //pbo[index].w = 0;
+        //pbo[index].x = timeToIntersect;
+        //pbo[index].y = timeToIntersect;
+        //pbo[index].z = timeToIntersect;
+        
+        //show normal
+        // note we need to times 255, if times 256, it becomes 0
+        /*pbo[index].w = 0;
+        pbo[index].x = glm::abs(gBuffer[index].normal.x) * 255.f;
+        pbo[index].y = glm::abs(gBuffer[index].normal.y) * 255.f;
+        pbo[index].z = glm::abs(gBuffer[index].normal.z) * 255.f;*/
+
+        //show position
         pbo[index].w = 0;
-        pbo[index].x = timeToIntersect;
-        pbo[index].y = timeToIntersect;
-        pbo[index].z = timeToIntersect;
+        pbo[index].x = glm::abs(gBuffer[index].position.x) * 20.f;
+        pbo[index].y = glm::abs(gBuffer[index].position.y) * 20.f;
+        pbo[index].z = glm::abs(gBuffer[index].position.z) * 20.f;
     }
 }
 
@@ -89,10 +106,14 @@ static Material * dev_materials = NULL;
 static PathSegment * dev_paths = NULL;
 static ShadeableIntersection * dev_intersections = NULL;
 static GBufferPixel* dev_gBuffer = NULL;
-// TODO: static variables for device memory, any extra info you need, etc
-// ...
-
-void pathtraceInit(Scene *scene) {
+static float* dev_kernel = NULL;
+static glm::ivec2* dev_offset = NULL;
+static glm::vec3* dev_image2 = NULL;
+static GBufferPixelZ * dev_gBufferZ = NULL;
+static float* dev_gaussian = NULL;
+static glm::ivec2* dev_gaussianOffset = NULL;
+
+void pathtraceInit(Scene *scene, bool useZforPos, bool gaussian) {
     hst_scene = scene;
     const Camera &cam = hst_scene->state.camera;
     const int pixelcount = cam.resolution.x * cam.resolution.y;
@@ -111,22 +132,45 @@ void pathtraceInit(Scene *scene) {
   	cudaMalloc(&dev_intersections, pixelcount * sizeof(ShadeableIntersection));
   	cudaMemset(dev_intersections, 0, pixelcount * sizeof(ShadeableIntersection));
 
-    cudaMalloc(&dev_gBuffer, pixelcount * sizeof(GBufferPixel));
 
     // TODO: initialize any extra device memeory you need
+    cudaMalloc(&dev_image2, pixelcount * sizeof(glm::vec3));
+    cudaMemset(dev_image2, 0, pixelcount * sizeof(glm::vec3));
+    cudaMalloc(&dev_kernel, 25 * sizeof(float));
+    cudaMalloc(&dev_offset, 25 * sizeof(glm::ivec2));
+    setKernelOffset(dev_kernel, dev_offset);
+    
+    if (useZforPos) {
+        cudaMalloc(&dev_gBufferZ, pixelcount * sizeof(GBufferPixelZ));
+    }
+    else {
+        cudaMalloc(&dev_gBuffer, pixelcount * sizeof(GBufferPixel));
+    }
 
+    //just malloc gaussian anyway
+    cudaMalloc(&dev_gaussianOffset, 49 * sizeof(glm::ivec2));
+    cudaMalloc(&dev_gaussian, 49 * sizeof(float));
+    setGaussianOffset(dev_gaussian, dev_gaussianOffset);
     checkCUDAError("pathtraceInit");
 }
 
-void pathtraceFree() {
+void pathtraceFree(bool useZforPos, bool gaussian) {
     cudaFree(dev_image);  // no-op if dev_image is null
   	cudaFree(dev_paths);
   	cudaFree(dev_geoms);
   	cudaFree(dev_materials);
   	cudaFree(dev_intersections);
-    cudaFree(dev_gBuffer);
     // TODO: clean up any extra device memory you created
-
+    cudaFree(dev_kernel);
+    cudaFree(dev_offset);
+    cudaFree(dev_image2);
+    if (useZforPos) {
+        cudaFree(dev_gBufferZ);
+    }
+    else {
+        cudaFree(dev_gBuffer);
+    }
+    cudaFree(dev_gaussian);
     checkCUDAError("pathtraceFree");
 }
 
@@ -140,15 +184,15 @@ void pathtraceFree() {
 */
 __global__ void generateRayFromCamera(Camera cam, int iter, int traceDepth, PathSegment* pathSegments)
 {
-	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;  //pixel index x
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;  //pixel index y
 
 	if (x < cam.resolution.x && y < cam.resolution.y) {
 		int index = x + (y * cam.resolution.x);
 		PathSegment & segment = pathSegments[index];
 
 		segment.ray.origin = cam.position;
-    segment.color = glm::vec3(1.0f, 1.0f, 1.0f);
+        segment.color = glm::vec3(1.0f, 1.0f, 1.0f);
 
 		segment.ray.direction = glm::normalize(cam.view
 			- cam.right * cam.pixelLength.x * ((float)x - (float)cam.resolution.x * 0.5f)
@@ -277,11 +321,32 @@ __global__ void generateGBuffer (
   int num_paths,
   ShadeableIntersection* shadeableIntersections,
 	PathSegment* pathSegments,
-  GBufferPixel* gBuffer) {
+  GBufferPixel* gBuffer,
+    GBufferPixelZ* gBufferZ,
+    bool useZforPos
+) {
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   if (idx < num_paths)
   {
-    gBuffer[idx].t = shadeableIntersections[idx].t;
+    if (useZforPos) {
+        gBufferZ[idx].normal = shadeableIntersections[idx].surfaceNormal;
+        if (shadeableIntersections[idx].t == -1) {
+            gBufferZ[idx].z = 0.f;
+        }
+        else {
+            gBufferZ[idx].z = shadeableIntersections[idx].t;
+        }
+    }
+    else {
+        gBuffer[idx].t = shadeableIntersections[idx].t;
+        gBuffer[idx].normal = shadeableIntersections[idx].surfaceNormal;
+        if (gBuffer[idx].t == -1.f) {
+            gBuffer[idx].position = glm::vec3(0.f);
+        }
+        else {
+            gBuffer[idx].position = pathSegments[idx].ray.origin + gBuffer[idx].t * pathSegments[idx].ray.direction;
+        }
+    }
   }
 }
 
@@ -297,11 +362,134 @@ __global__ void finalGather(int nPaths, glm::vec3 * image, PathSegment * iterati
 	}
 }
 
+__host__ __device__ void getPosFromZ(
+    int pixelX,
+    int pixelY,
+    Camera cam,
+    float z,
+    glm::vec3& pos
+) {
+    glm::vec3 origin = cam.position;
+
+    glm::vec3 dir = glm::normalize(cam.view
+        - cam.right * cam.pixelLength.x * ((float)pixelX - (float)cam.resolution.x * 0.5f)
+        - cam.up * cam.pixelLength.y * ((float)pixelY - (float)cam.resolution.y * 0.5f));
+    pos = origin + dir * (z - 0.0000001f);
+}
+
+__global__ void kernDenoise(
+    int num_paths, 
+    glm::vec3* image,
+    float* kernel,
+    glm::ivec2* offset,
+    GBufferPixel* gBuffers,
+    int filterSize,
+    int num_step,
+    Camera cam,
+    float c_phi,
+    float n_phi,
+    float p_phi,
+    glm::vec3* image2,
+    int iteration,
+    GBufferPixelZ* gBufferZ, 
+    bool useZforPos,
+    int kernSize
+) {
+    int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+    if (index < num_paths)
+    {
+        glm::vec3 sum = glm::vec3(0.f);
+        glm::vec3 cval = image[index];
+        glm::vec3 nval;
+        glm::vec3 pval;
+        int pixelY = index / cam.resolution.x;
+        int pixelX = index - (pixelY * cam.resolution.x);
+
+        if (useZforPos) {
+            float z = gBufferZ[index].z;
+            getPosFromZ(pixelX, pixelY, cam, z, pval);
+            nval = gBufferZ[index].normal;
+        }
+        else {
+            nval = gBuffers[index].normal;
+            pval = gBuffers[index].position;
+        }
+        float cum_w = 0.f;
+
+        //offset: (-2, -2), (-2, -1), (-2, 0), ....
+        for (int i = 0; i < kernSize; ++i) {
+
+            int stepSize = pow(2, num_step);
+            glm::ivec2 currOffset = offset[i] * stepSize;
+
+            //keep currPixel inside image
+            int currPixelX = glm::clamp(currOffset.x + pixelX, 0, cam.resolution.x - 1);
+            int currPixelY = glm::clamp(currOffset.y + pixelY, 0, cam.resolution.y - 1);
+            int currIndex = currPixelX + (currPixelY * cam.resolution.x);    
+
+            glm::vec3 ctmp = image[currIndex];
+            glm::vec3 t = cval - ctmp;
+            
+            //increase color weight when iteration is higher
+            float dist2 = glm::dot(t, t);
+            float c_w = glm::min(glm::exp(-(dist2) / c_phi), 1.f);
+
+            glm::vec3 ptmp;
+            glm::vec3 ntmp;
+            if (useZforPos) {
+                ntmp = gBufferZ[currIndex].normal;
+            }
+            else {
+                ntmp = gBuffers[currIndex].normal;
+            }
+            t = nval - ntmp;
+            dist2 = glm::max(glm::dot(t, t) / (stepSize * stepSize), 0.f);
+            float n_w = glm::min(glm::exp(-(dist2) / n_phi), 1.f);
+
+            if (useZforPos) {
+                float currZ = gBufferZ[currIndex].z;
+                getPosFromZ(currPixelX, currPixelY, cam, currZ, ptmp);
+            }
+            else {
+                ptmp = gBuffers[currIndex].position;
+            }
+            t = pval - ptmp;
+            dist2 = glm::dot(t, t);
+            float p_w = glm::min(glm::exp(-(dist2) / p_phi), 1.f);
+
+            float weight = c_w * n_w * p_w;
+
+            sum += ctmp * weight * kernel[i];
+            cum_w += weight * kernel[i];
+        }
+        image2[index] = sum / cum_w;
+    }
+}
+
+struct remainingBounceIsNot0 {
+    __host__ __device__
+        bool operator()(const PathSegment& p1) {
+        return (p1.remainingBounces > 0);
+    }
+};
+
+__global__ void kernImageCopy(
+    int num_paths,
+    glm::vec3* image1,
+    glm::vec3* image2
+) {
+    int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+    if (index < num_paths) {
+        image1[index] = image2[index];
+    }
+}
 /**
  * Wrapper for the __global__ call that sets up the kernel calls and does a ton
  * of memory management
  */
-void pathtrace(int frame, int iter) {
+void pathtrace(int frame, int iter, int filterSize, float c_phi, float n_phi, float p_phi, bool denoiser, bool useZforPos, bool gaussian) {
     const int traceDepth = hst_scene->state.traceDepth;
     const Camera &cam = hst_scene->state.camera;
     const int pixelcount = cam.resolution.x * cam.resolution.y;
@@ -352,59 +540,89 @@ void pathtrace(int frame, int iter) {
 	int depth = 0;
 	PathSegment* dev_path_end = dev_paths + pixelcount;
 	int num_paths = dev_path_end - dev_paths;
+    int curr_num_paths = num_paths;
 
 	// --- PathSegment Tracing Stage ---
 	// Shoot ray into scene, bounce between objects, push shading chunks
-
-  // Empty gbuffer
-  cudaMemset(dev_gBuffer, 0, pixelcount * sizeof(GBufferPixel));
-
+    if (denoiser) {
+        // Empty gbuffer
+        cudaMemset(dev_gBuffer, 0, pixelcount * sizeof(GBufferPixel));
+    }
 	// clean shading chunks
 	cudaMemset(dev_intersections, 0, pixelcount * sizeof(ShadeableIntersection));
 
-  bool iterationComplete = false;
+    bool iterationComplete = false;
 	while (!iterationComplete) {
 
-	// tracing
-	dim3 numblocksPathSegmentTracing = (num_paths + blockSize1d - 1) / blockSize1d;
-	computeIntersections <<<numblocksPathSegmentTracing, blockSize1d>>> (
-		depth
-		, num_paths
-		, dev_paths
-		, dev_geoms
-		, hst_scene->geoms.size()
-		, dev_intersections
-		);
-	checkCUDAError("trace one bounce");
-	cudaDeviceSynchronize();
-
-  if (depth == 0) {
-    generateGBuffer<<<numblocksPathSegmentTracing, blockSize1d>>>(num_paths, dev_intersections, dev_paths, dev_gBuffer);
-  }
-
-	depth++;
-
-  shadeSimpleMaterials<<<numblocksPathSegmentTracing, blockSize1d>>> (
-    iter,
-    num_paths,
-    dev_intersections,
-    dev_paths,
-    dev_materials
-  );
-  iterationComplete = depth == traceDepth;
+	    // tracing
+	    dim3 numblocksPathSegmentTracing = (num_paths + blockSize1d - 1) / blockSize1d;
+	    computeIntersections <<<numblocksPathSegmentTracing, blockSize1d>>> (
+		    depth
+		    , num_paths
+		    , dev_paths
+		    , dev_geoms
+		    , hst_scene->geoms.size()
+		    , dev_intersections
+		    );
+	    checkCUDAError("trace one bounce");
+	    cudaDeviceSynchronize();
+
+        if (denoiser) {
+            if (depth == 0) {
+                generateGBuffer << <numblocksPathSegmentTracing, blockSize1d >> > (num_paths, dev_intersections, dev_paths, dev_gBuffer, dev_gBufferZ, useZforPos);
+            }
+        }
+
+	    ++depth;
+
+        shadeSimpleMaterials<<<numblocksPathSegmentTracing, blockSize1d>>> (
+            iter,
+            num_paths,
+            dev_intersections,
+            dev_paths,
+            dev_materials
+        );
+
+        dev_path_end = thrust::partition(thrust::device, dev_paths, dev_paths + curr_num_paths, remainingBounceIsNot0());
+        curr_num_paths = dev_path_end - dev_paths;
+
+        iterationComplete = ((depth == traceDepth) || (curr_num_paths == 0));
 	}
 
-  // Assemble this iteration and apply it to the image
-  dim3 numBlocksPixels = (pixelcount + blockSize1d - 1) / blockSize1d;
+    // Assemble this iteration and apply it to the image
+    dim3 numBlocksPixels = (pixelcount + blockSize1d - 1) / blockSize1d;
 	finalGather<<<numBlocksPixels, blockSize1d>>>(num_paths, dev_image, dev_paths);
 
+    if (denoiser) {
+        //filter size represents one dimension of the filter
+        int num_steps = ceil(log2(filterSize/2));
+        if (num_steps != 0) {
+            for (int i = 0; i < num_steps; ++i) {
+                if (gaussian) {
+                    int kernSize = 49;
+                    kernDenoise << <numBlocksPixels, blockSize1d >> > (num_paths, dev_image, dev_gaussian, dev_gaussianOffset, dev_gBuffer, filterSize, i, cam, c_phi, n_phi, p_phi, dev_image2, iter, dev_gBufferZ, useZforPos, kernSize);
+                }
+                else {
+                    int kernSize = 25;
+                    kernDenoise << <numBlocksPixels, blockSize1d >> > (num_paths, dev_image, dev_kernel, dev_offset, dev_gBuffer, filterSize, i, cam, c_phi, n_phi, p_phi, dev_image2, iter, dev_gBufferZ, useZforPos, kernSize);
+                }
+                if (i != (num_steps - 1)) {
+                    cudaMemcpy(dev_image, dev_image2, pixelcount * sizeof(glm::vec3), cudaMemcpyDeviceToDevice);
+                }
+            }
+        }
+        cudaMemcpy(hst_scene->state.image.data(), dev_image2,
+            pixelcount * sizeof(glm::vec3), cudaMemcpyDeviceToHost);
+    }
+    else {
+        cudaMemcpy(hst_scene->state.image.data(), dev_image,
+            pixelcount * sizeof(glm::vec3), cudaMemcpyDeviceToHost);
+    }
     ///////////////////////////////////////////////////////////////////////////
 
     // CHECKITOUT: use dev_image as reference if you want to implement saving denoised images.
     // Otherwise, screenshots are also acceptable.
     // Retrieve image from GPU
-    cudaMemcpy(hst_scene->state.image.data(), dev_image,
-            pixelcount * sizeof(glm::vec3), cudaMemcpyDeviceToHost);
 
     checkCUDAError("pathtrace");
 }
@@ -431,3 +649,49 @@ const Camera &cam = hst_scene->state.camera;
     // Send results to OpenGL buffer for rendering
     sendImageToPBO<<<blocksPerGrid2d, blockSize2d>>>(pbo, cam.resolution, iter, dev_image);
 }
+
+void setKernelOffset(float* dev_kernel, glm::ivec2* dev_offset) {
+    //set offset
+    //(-2, -2), (-2, -1), (-2, 0), ....
+    int offsetCount = 0;
+    for (int i = -2; i < 3; ++i) {
+        for (int j = -2; j < 3; ++j) {
+            cudaMemcpy(dev_offset + offsetCount, &(glm::ivec2(i, j)), sizeof(glm::ivec2), cudaMemcpyHostToDevice);
+            ++offsetCount;
+        }
+    }
+
+    std::vector<float> kernelNominator = { 1.f, 4.f, 7.f, 4.f, 1.f,
+                                          4.f, 16.f, 26.f, 16.f, 4.f,
+                                          7.f, 26.f, 41.f, 26.f, 7.f,
+                                          4.f, 16.f, 26.f, 16.f, 4.f,
+                                          1.f, 4.f, 7.f, 4.f, 1.f };
+    for (int i = 0; i < 25; ++i) {
+        kernelNominator[i] /= 273.f;
+        cudaMemcpy(dev_kernel + i, &kernelNominator[i], sizeof(float), cudaMemcpyHostToDevice);
+    }
+}
+
+void setGaussianOffset(float* gaussian, glm::ivec2* gaussianOffset) {
+    
+    std::vector<float> kernelNominator = { 0.f, 0.f, 1.f, 2.f, 1.f, 0.f, 0.f,
+                                          0.f, 3.f, 13.f, 22.f, 13.f, 3.f, 0.f,
+                                          1.f, 13.f, 59.f, 97.f, 59.f, 13.f, 1.f,
+                                          2.f, 22.f, 97.f, 159.f, 97.f, 22.f, 2.f,
+                                          1.f, 13.f, 59.f, 97.f, 59.f, 13.f, 1.f, 
+                                          0.f, 3.f, 13.f, 22.f, 13.f, 3.f, 0.f,
+                                          0.f, 0.f, 1.f, 2.f, 1.f, 0.f, 0.f
+    };
+    for (int i = 0; i < 49; ++i) {
+        kernelNominator[i] /= 1003.f;
+        cudaMemcpy(gaussian + i, &kernelNominator[i], sizeof(float), cudaMemcpyHostToDevice);
+    }
+
+    int offsetCount = 0;
+    for (int i = -3; i < 4; ++i) {
+        for (int j = -3; j < 4; ++j) {
+            cudaMemcpy(gaussianOffset + offsetCount, &(glm::ivec2(i, j)), sizeof(glm::ivec2), cudaMemcpyHostToDevice);
+            ++offsetCount;
+        }
+    }
+}
\ No newline at end of file
diff --git a/src/pathtrace.h b/src/pathtrace.h
index 9e12f440..c1384e76 100644
--- a/src/pathtrace.h
+++ b/src/pathtrace.h
@@ -3,8 +3,10 @@
 #include <vector>
 #include "scene.h"
 
-void pathtraceInit(Scene *scene);
-void pathtraceFree();
-void pathtrace(int frame, int iteration);
+void pathtraceInit(Scene *scene, bool useZforPos, bool gaussian);
+void pathtraceFree(bool useZforPos, bool gaussian);
+void pathtrace(int frame, int iteration, int filterSeize, float colorWeight, float normalWeight, float positionWeight, bool denoiser, bool useZforPos, bool gaussian);
 void showGBuffer(uchar4 *pbo);
 void showImage(uchar4 *pbo, int iter);
+void setKernelOffset(float* dev_kernel, glm::ivec2* dev_offset);
+void setGaussianOffset(float* dev_gaussian, glm::ivec2* dev_offset);
\ No newline at end of file
diff --git a/src/preview.cpp b/src/preview.cpp
index 3ca27180..e8364868 100644
--- a/src/preview.cpp
+++ b/src/preview.cpp
@@ -212,17 +212,18 @@ void drawGui(int windowWidth, int windowHeight) {
 
     ImGui::SliderInt("Iterations", &ui_iterations, 1, startupIterations);
 
-    ImGui::Checkbox("Denoise", &ui_denoise);
-
     ImGui::SliderInt("Filter Size", &ui_filterSize, 0, 100);
-    ImGui::SliderFloat("Color Weight", &ui_colorWeight, 0.0f, 10.0f);
-    ImGui::SliderFloat("Normal Weight", &ui_normalWeight, 0.0f, 10.0f);
-    ImGui::SliderFloat("Position Weight", &ui_positionWeight, 0.0f, 10.0f);
+    ImGui::SliderFloat("Color Weight", &ui_colorWeight, 0.0f, 100.f);
+    ImGui::SliderFloat("Normal Weight", &ui_normalWeight, 0.0f, 2.f);
+    ImGui::SliderFloat("Position Weight", &ui_positionWeight, 0.0f, 5.f);
 
     ImGui::Separator();
 
     ImGui::Checkbox("Show GBuffer", &ui_showGbuffer);
-
+    ImGui::Checkbox("Denoiser", &ui_denoiser);
+    ImGui::Checkbox("Using Z for position", &ui_useZforPos);
+    ImGui::Checkbox("Gaussian", &ui_gaussian);
+    ImGui::Text("Runtime = %f", ui_timer);
     ImGui::Separator();
 
     if (ImGui::Button("Save image and exit")) {
diff --git a/src/sceneStructs.h b/src/sceneStructs.h
index da7e558a..5ec88124 100644
--- a/src/sceneStructs.h
+++ b/src/sceneStructs.h
@@ -79,4 +79,11 @@ struct ShadeableIntersection {
 // What information might be helpful for guiding a denoising filter?
 struct GBufferPixel {
   float t;
+  glm::vec3 normal;
+  glm::vec3 position;
 };
+
+struct GBufferPixelZ {
+    float z;
+    glm::vec3 normal;
+};
\ No newline at end of file