diff --git a/README.md b/README.md
index a82ea0f..e96271a 100644
--- a/README.md
+++ b/README.md
@@ -3,211 +3,116 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
-
-### (TODO: Your README)
-
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
-
-Instructions (delete me)
-========================
-
-This is due Sunday, September 13 at midnight.
-
-**Summary:** In this project, you'll implement GPU stream compaction in CUDA,
-from scratch. This algorithm is widely used, and will be important for
-accelerating your path tracer project.
-
-Your stream compaction implementations in this project will simply remove `0`s
-from an array of `int`s. In the path tracer, you will remove terminated paths
-from an array of rays.
-
-In addition to being useful for your path tracer, this project is meant to
-reorient your algorithmic thinking to the way of the GPU. On GPUs, many
-algorithms can benefit from massive parallelism and, in particular, data
-parallelism: executing the same code many times simultaneously with different
-data.
-
-You'll implement a few different versions of the *Scan* (*Prefix Sum*)
-algorithm. First, you'll implement a CPU version of the algorithm to reinforce
-your understanding. Then, you'll write a few GPU implementations: "naive" and
-"work-efficient." Finally, you'll use some of these to implement GPU stream
-compaction.
-
-**Algorithm overview & details:** There are two primary references for details
-on the implementation of scan and stream compaction.
-
-* The [slides on Parallel Algorithms](https://github.com/CIS565-Fall-2015/cis565-fall-2015.github.io/raw/master/lectures/2-Parallel-Algorithms.pptx)
-  for Scan, Stream Compaction, and Work-Efficient Parallel Scan.
-* GPU Gems 3, Chapter 39 - [Parallel Prefix Sum (Scan) with CUDA](http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html).
-
-Your GPU stream compaction implementation will live inside of the
-`stream_compaction` subproject. This way, you will be able to easily copy it
-over for use in your GPU path tracer.
-
-
-## Part 0: The Usual
-
-This project (and all other CUDA projects in this course) requires an NVIDIA
-graphics card with CUDA capability. Any card with Compute Capability 2.0
-(`sm_20`) or greater will work. Check your GPU on this
-[compatibility table](https://developer.nvidia.com/cuda-gpus).
-If you do not have a personal machine with these specs, you may use those
-computers in the Moore 100B/C which have supported GPUs.
-
-**HOWEVER**: If you need to use the lab computer for your development, you will
-not presently be able to do GPU performance profiling. This will be very
-important for debugging performance bottlenecks in your program.
-
-### Useful existing code
-
-* `stream_compaction/common.h`
-  * `checkCUDAError` macro: checks for CUDA errors and exits if there were any.
-  * `ilog2ceil(x)`: computes the ceiling of log2(x), as an integer.
-* `main.cpp`
-  * Some testing code for your implementations.
-
-
-## Part 1: CPU Scan & Stream Compaction
-
-This stream compaction method will remove `0`s from an array of `int`s.
-
-In `stream_compaction/cpu.cu`, implement:
-
-* `StreamCompaction::CPU::scan`: compute an exclusive prefix sum.
-* `StreamCompaction::CPU::compactWithoutScan`: stream compaction without using
-  the `scan` function.
-* `StreamCompaction::CPU::compactWithScan`: stream compaction using the `scan`
-  function. Map the input array to an array of 0s and 1s, scan it, and use
-  scatter to produce the output. You will need a **CPU** scatter implementation
-  for this (see slides or GPU Gems chapter for an explanation).
-
-These implementations should only be a few lines long.
-
-
-## Part 2: Naive GPU Scan Algorithm
-
-In `stream_compaction/naive.cu`, implement `StreamCompaction::Naive::scan`
-
-This uses the "Naive" algorithm from GPU Gems 3, Section 39.2.1. We haven't yet
-taught shared memory, and you **shouldn't use it yet**. Example 39-1 uses
-shared memory, but is limited to operating on very small arrays! Instead, write
-this using global memory only. As a result of this, you will have to do
-`ilog2ceil(n)` separate kernel invocations.
-
-Beware of errors in Example 39-1 in the book; both the pseudocode and the CUDA
-code in the online version of Chapter 39 are known to have a few small errors
-(in superscripting, missing braces, bad indentation, etc.)
-
-Since the parallel scan algorithm operates on a binary tree structure, it works
-best with arrays with power-of-two length. Make sure your implementation works
-on non-power-of-two sized arrays (see `ilog2ceil`). This requires extra memory
-- your intermediate array sizes will need to be rounded to the next power of
-two.
-
-
-## Part 3: Work-Efficient GPU Scan & Stream Compaction
-
-### 3.1. Scan
-
-In `stream_compaction/efficient.cu`, implement
-`StreamCompaction::Efficient::scan`
-
-All of the text in Part 2 applies.
-
-* This uses the "Work-Efficient" algorithm from GPU Gems 3, Section 39.2.2.
-* Beware of errors in Example 39-2.
-* Test non-power-of-two sized arrays.
-
-### 3.2. Stream Compaction
-
-This stream compaction method will remove `0`s from an array of `int`s.
-
-In `stream_compaction/efficient.cu`, implement
-`StreamCompaction::Efficient::compact`
-
-For compaction, you will also need to implement the scatter algorithm presented
-in the slides and the GPU Gems chapter.
-
-In `stream_compaction/common.cu`, implement these for use in `compact`:
-
-* `StreamCompaction::Common::kernMapToBoolean`
-* `StreamCompaction::Common::kernScatter`
-
-
-## Part 4: Using Thrust's Implementation
-
-In `stream_compaction/thrust.cu`, implement:
-
-* `StreamCompaction::Thrust::scan`
-
-This should be a very short function which wraps a call to the Thrust library
-function `thrust::exclusive_scan(first, last, result)`.
-
-To measure timing, be sure to exclude memory operations by passing
-`exclusive_scan` a `thrust::device_vector` (which is already allocated on the
-GPU).  You can create a `thrust::device_vector` by creating a
-`thrust::host_vector` from the given pointer, then casting it.
-
-
-## Part 5: Radix Sort (Extra Credit) (+10)
-
-Add an additional module to the `stream_compaction` subproject. Implement radix
-sort using one of your scan implementations. Add tests to check its correctness.
-
-
-## Write-up
-
-1. Update all of the TODOs at the top of this README.
-2. Add a description of this project including a list of its features.
-3. Add your performance analysis (see below).
-
-All extra credit features must be documented in your README, explaining its
-value (with performance comparison, if applicable!) and showing an example how
-it works. For radix sort, show how it is called and an example of its output.
-
-Always profile with Release mode builds and run without debugging.
-
-### Questions
-
-* Roughly optimize the block sizes of each of your implementations for minimal
-  run time on your GPU.
-  * (You shouldn't compare unoptimized implementations to each other!)
-
-* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and
-  Thrust) to the serial CPU version of Scan. Plot a graph of the comparison
-  (with array size on the independent axis).
-  * You should use CUDA events for timing. Be sure **not** to include any
-    explicit memory operations in your performance measurements, for
-    comparability.
-  * To guess at what might be happening inside the Thrust implementation, take
-    a look at the Nsight timeline for its execution.
-
-* Write a brief explanation of the phenomena you see here.
-  * Can you find the performance bottlenecks? Is it memory I/O? Computation? Is
-    it different for each implementation?
-
-* Paste the output of the test program into a triple-backtick block in your
-  README.
-  * If you add your own tests (e.g. for radix sort or to test additional corner
-    cases), be sure to mention it explicitly.
-
-These questions should help guide you in performance analysis on future
-assignments, as well.
-
-## Submit
-
-If you have modified any of the `CMakeLists.txt` files at all (aside from the
-list of `SOURCE_FILES`), you must test that your project can build in Moore
-100B/C. Beware of any build issues discussed on the Google Group.
-
-1. Open a GitHub pull request so that we can see that you have finished.
-   The title should be "Submission: YOUR NAME".
-2. Send an email to the TA (gmail: kainino1+cis565@) with:
-   * **Subject**: in the form of `[CIS565] Project 2: PENNKEY`
-   * Direct link to your pull request on GitHub
-   * In the form of a grade (0-100+) with comments, evaluate your own
-     performance on the project.
-   * Feedback on the project itself, if any.
+* Ziwei Zong
+* Tested on: Windows 10, i7-5500 @ 2.40GHz 8GB, GTX 950M (Personal)
+
+Descriptions
+--------------------------
+ * Part 1 : CPU Scan & Compaction (see file cpu.cu)
+ * Part 2 : Naive Scan (naive.cu)
+ * Part 3 : Work-Efficient Scan & Compaction (efficient.cu)
+ * Part 4 : Thrust Scan (thrust.cu)
+ * Part 5 : Radix Sort (in file thrust.cu, RadixSort::sort)
+
+Block Sizes Optimization
+--------------------------
+|            |  32 |  64 | 128 | 256 | 512 | 1024|
+|------------|-----|-----|-----|-----|-----|-----|
+| block_naive|0.062|0.061|0.060|0.062|0.064|0.078|
+|   block_eff|0.139|0.139|0.140|0.142|0.148|0.155|
+|block_thrust|1.060|1.180|1.200|1.100|1.029|1.090| 
+(ms)
+Thus, I choose block size 128 for naive scan, 64 for efficient scan and 512 for thrust scan.
+
+CPU & GPU Performance Comparison
+--------------------------------
+|Array_length|		2^4|	2^6|	2^8|	2^10|	2^12|
+|------------|---------|-------|-------|--------|-------|
+|cpu			|power of 2		|0.0024	|0.002304	|0.002304	|0.002304	|0.002304|
+|cpu			|! Power of 2	|0.0023	|0.002304	|0.002304	|0.002336	|0.002336|
+|gpu naīve		|power of 2		|0.027	|0.0374		|0.0496		|	0.0609	|0.0766|
+|gpu naīve		|! Power of 2	|0.0252	|0.0355		|0.0473		|0.0587		|0.0745|
+|gpu efficient	|power of 2		|0.0541	|0.0765		|0.0985		|0.1312		|0.1551|
+|gpu efficient	|! Power of 2	|0.053	|0.0748		|0.0969		|0.1307		|0.1537|
+|gpu thrust		|power of 2		|0.03293|0.3129		|0.3414		|	0.3854	|0.3011|
+|gpu thrust		|! Power of 2	|0.3039	|0.3476		|0.2955		|	0.3535	|	0.3035|
+(ms)
+Graph is shown below.
+![](images/power of 2.png)
+![](images/not power of 2.png)
+- As the line chart shows, the calculation time for thrust scan doesn't have any specific rules while naive and efficient scan time perform power function correlation with the length of array.
+- Weirdly, the time for CPU is very small no matter how long the array is. I was assumming that it was not right to count CPU time using Cuda event so I then tried to use std::clock. However, the result time is even smaller(0.0). Therefore, CPU did perform better than GPU for this task, and that might due to the high cost for memory accessing.
+- The naive scan method also performs better than the efficient scan which is not as expected. 
+
+
+Output 
+--------------------------
+
+****************
+** SCAN TESTS **
+****************
+
+    [  38  19  38  37   5  47  15  35   0  12   3   0  42 ...  26   0 ]
+	==== cpu scan, power-of-two ====
+	StreamCompaction::CPU::scan : exclusive prefix sum.
+	[   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 6203 6229 ]
+	==== cpu scan, non-power-of-two ====
+	StreamCompaction::CPU::scan : exclusive prefix sum.
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 6146 6190 ]
+    passed
+	==== naive scan, power-of-two ====
+         GPU time for naive scan : 0.0696ms
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 6203 6229 ]
+    passed
+	==== naive scan, non-power-of-two ====
+         GPU time for naive scan : 0.0676ms
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 6146 6190 ]
+    passed
+	==== work-efficient scan, power-of-two ====
+    [  38  19  38  37   5  47  15  35   0  12   3   0  42 ...  26   0 ]
+         GPU time for efficient scan : 0.1403ms
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 6203 6229 ]
+    passed
+	==== work-efficient scan, non-power-of-two ====
+    [  38  19  38  37   5  47  15  35   0  12   3   0  42 ...  44   8 ]
+         GPU time for efficient scan : 0.1403ms
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 6146 6190 ]
+    passed
+	==== thrust scan, power-of-two ====
+         GPU time for thrust scan : 128.8397ms
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 6203 6229 ]
+    passed
+	==== thrust scan, non-power-of-two ====
+         GPU time for thrust scan : 1.1305ms
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 6146 6190 ]
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   2   3   2   1   3   1   1   1   2   0   1   0   2 ...   0   0 ]
+	==== cpu compact without scan, power-of-two ====
+    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   2   1 ]
+    passed
+	==== cpu compact without scan, non-power-of-two ====
+    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   3   2 ]
+    passed
+	==== cpu compact with scan ====
+	StreamCompaction::CPU::scan : exclusive prefix sum.
+    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   2   1 ]
+    passed
+	==== work-efficient compact, power-of-two ====
+    [   2   3   2   1   3   1   1   1   2   0   1   0   2 ...   0   0 ]
+    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   2   1 ]
+    passed
+	==== work-efficient compact, non-power-of-two ====
+    [   2   3   2   1   3   1   1   1   2   0   1   0   2 ...   0   0 ]
+    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   3   2 ]
+    passed
+
+*****************************
+**        Radix Sort       **
+*****************************
+	==== Radix Sort, power-of-two ====
+    [   4   7   2   6   3   5   1   0 ]
+    [   4   2   6   0   7   3   5   1 ]
diff --git a/images/not power of 2.png b/images/not power of 2.png
new file mode 100644
index 0000000..6297623
Binary files /dev/null and b/images/not power of 2.png differ
diff --git a/images/power of 2.png b/images/power of 2.png
new file mode 100644
index 0000000..050a6f4
Binary files /dev/null and b/images/power of 2.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 7308451..9834fc7 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -12,33 +12,35 @@
 #include <stream_compaction/efficient.h>
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
+#include <string>
+#include <iostream>
 
 int main(int argc, char* argv[]) {
-    const int SIZE = 1 << 8;
+    const int SIZE = 1 << 10;
     const int NPOT = SIZE - 3;
     int a[SIZE], b[SIZE], c[SIZE];
 
     // Scan tests
-
+	
     printf("\n");
     printf("****************\n");
     printf("** SCAN TESTS **\n");
     printf("****************\n");
 
     genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
+	a[SIZE - 1] = 0;
     printArray(SIZE, a, true);
-
     zeroArray(SIZE, b);
     printDesc("cpu scan, power-of-two");
     StreamCompaction::CPU::scan(SIZE, b, a);
-    printArray(SIZE, b, true);
+    //printArray(SIZE, b, true);
 
     zeroArray(SIZE, c);
     printDesc("cpu scan, non-power-of-two");
     StreamCompaction::CPU::scan(NPOT, c, a);
-    printArray(NPOT, b, true);
+    //printArray(NPOT, b, true);
     printCmpResult(NPOT, b, c);
-
+	
     zeroArray(SIZE, c);
     printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
@@ -48,15 +50,14 @@ int main(int argc, char* argv[]) {
     zeroArray(SIZE, c);
     printDesc("naive scan, non-power-of-two");
     StreamCompaction::Naive::scan(NPOT, c, a);
-    //printArray(SIZE, c, true);
+	//printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
-
+	
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     //printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
-
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, non-power-of-two");
     StreamCompaction::Efficient::scan(NPOT, c, a);
@@ -74,7 +75,7 @@ int main(int argc, char* argv[]) {
     StreamCompaction::Thrust::scan(NPOT, c, a);
     //printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
-
+	//*
     printf("\n");
     printf("*****************************\n");
     printf("** STREAM COMPACTION TESTS **\n");
@@ -84,9 +85,8 @@ int main(int argc, char* argv[]) {
 
     genArray(SIZE - 1, a, 4);  // Leave a 0 at the end to test that edge case
     printArray(SIZE, a, true);
-
+	a[SIZE - 1] = 0;
     int count, expectedCount, expectedNPOT;
-
     zeroArray(SIZE, b);
     printDesc("cpu compact without scan, power-of-two");
     count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
@@ -109,13 +109,37 @@ int main(int argc, char* argv[]) {
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, power-of-two");
+	printArray(SIZE, a, true);//delete
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, non-power-of-two");
+	printArray(SIZE, a, true);//delete
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
+
+	printf("\n");
+	printf("*****************************\n");
+	printf("**        Radix Sort       **\n");
+	printf("*****************************\n");
+
+	a[0] = 4;
+	a[1] = 7;
+	a[2] = 2;
+	a[3] = 6;
+	a[4] = 3;
+	a[5] = 5;
+	a[6] = 1;
+	a[7] = 0;
+
+	zeroArray(8, c);
+	printDesc("Radix Sort, power-of-two");
+	printArray(8, a, true);//delete
+	StreamCompaction::RadixSort::sort(8, c, a);
+	printArray(8, c, true);
+	//printCmpResult(SIZE, b, c);*/
+	std::cin.get();
 }
diff --git a/stream_compaction/common.h b/stream_compaction/common.h
index 4f52663..698a919 100644
--- a/stream_compaction/common.h
+++ b/stream_compaction/common.h
@@ -7,6 +7,10 @@
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
 
+#define blockSize_eff	64
+#define blockSize_naive	128
+#define blockSize_thrust 512
+
 /**
  * Check for CUDA errors; print and exit if there was a problem.
  */
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index e600c29..cc8e3a6 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -1,5 +1,8 @@
 #include <cstdio>
 #include "cpu.h"
+#include "common.h"
+#include <cuda_runtime.h>
+#include <ctime>
 
 namespace StreamCompaction {
 namespace CPU {
@@ -8,8 +11,40 @@ namespace CPU {
  * CPU scan (prefix sum).
  */
 void scan(int n, int *odata, const int *idata) {
-    // TODO
-    printf("TODO\n");
+	cudaEvent_t start, stop;
+	cudaEventCreate(&start);
+	cudaEventCreate(&stop);
+
+	cudaEventRecord(start);
+	std::clock_t t1, t2;
+	t1 = std::clock();
+
+	double duration;
+
+	for (int i = 0; i < n; i++)
+	{
+		if (i==0)
+		{
+			odata[i] = 0;
+			continue;
+		}
+		odata[i] = odata[i - 1] + idata[i - 1];
+	}
+	t2 = std::clock();
+
+	duration = (float)t2 - (float)t1;
+
+	cudaEventRecord(stop);
+
+	cudaEventSynchronize(stop);
+	float milliseconds = 0;
+	cudaEventElapsedTime(&milliseconds, start, stop);
+
+	printf("\t(CudaEvent)CPU time for scan : %3fms\n",milliseconds);
+	printf("\t(Clock)CPU time for scan : %3fms\n", duration);
+
+	// TO_DOne
+    //printf("StreamCompaction::CPU::scan : exclusive prefix sum.\n");
 }
 
 /**
@@ -18,8 +53,16 @@ void scan(int n, int *odata, const int *idata) {
  * @returns the number of elements remaining after compaction.
  */
 int compactWithoutScan(int n, int *odata, const int *idata) {
-    // TODO
-    return -1;
+	int k = 0;
+	for (int i = 0; i < n; i++)
+	{
+		if (idata[i] != 0)
+		{
+			odata[k++] = idata[i];
+		}
+	}
+    // TO__DOne
+    return k;
 }
 
 /**
@@ -28,8 +71,36 @@ int compactWithoutScan(int n, int *odata, const int *idata) {
  * @returns the number of elements remaining after compaction.
  */
 int compactWithScan(int n, int *odata, const int *idata) {
-    // TODO
-    return -1;
+    // TO_DOne
+	int*TempArray = new int[n+1];
+	int*ScanArray = new int[n+1];
+	for (int i = 0; i < n; i++)
+	{
+		if (idata[i] != 0)
+		{
+			TempArray[i] = 1;
+		}
+		else
+		{
+			TempArray[i] = 0;
+		}
+	}
+	TempArray[n] = 0;
+	scan(n+1, ScanArray, TempArray);
+
+	int k = 0;
+	for (int i = 0; i < n; i++)
+	{
+		if (TempArray[i]==1)
+		{
+			odata[ScanArray[i]] =  idata[i];
+			k++;
+		}
+	}
+	int count =  ScanArray[n];
+	delete[] TempArray;
+	delete[] ScanArray;
+	return count;
 }
 
 }
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index b2f739b..81aa40e 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -4,31 +4,190 @@
 #include "efficient.h"
 
 namespace StreamCompaction {
-namespace Efficient {
-
-// TODO: __global__
-
-/**
- * Performs prefix-sum (aka scan) on idata, storing the result into odata.
- */
-void scan(int n, int *odata, const int *idata) {
-    // TODO
-    printf("TODO\n");
-}
-
-/**
- * Performs stream compaction on idata, storing the result into odata.
- * All zeroes are discarded.
- *
- * @param n      The number of elements in idata.
- * @param odata  The array into which to store elements.
- * @param idata  The array of elements to compact.
- * @returns      The number of elements remaining after compaction.
- */
-int compact(int n, int *odata, const int *idata) {
-    // TODO
-    return -1;
-}
-
-}
-}
+	namespace Efficient {
+
+		// TODO: __global__
+
+		/**
+		 * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+		 */
+		int * dev_o;
+
+		__global__ void kernUpSweep(int pow2d, int * dev_odata,int end)
+		{
+			int k = blockIdx.x * blockDim.x + threadIdx.x;
+			if (k > end) return;
+			dev_odata[k * 2 * pow2d + (int)pow2d * 2 - 1] += dev_odata[k * 2 * pow2d + (int)pow2d - 1];
+		}
+		
+		__global__ void kernDownSweep(int pow2d, int * dev_odata,int n,int end)
+		{
+			
+			int k = (blockIdx.x * blockDim.x + threadIdx.x);
+			if (k > end) return;
+			k = k * 2 * pow2d;	
+			//dev_odata[k * 2 * pow2d + (int)pow2d * 2 - 1] += dev_odata[k * 2 * pow2d + (int)pow2d - 1];
+			int t = dev_odata[k + pow2d - 1];
+			dev_odata[k + pow2d - 1] = dev_odata[k + pow2d * 2 - 1];
+			dev_odata[k + pow2d * 2 - 1] += t;
+		
+		}
+		__global__ void setRootZero(int * dev_odata,int n)
+		{
+			dev_odata[n - 1] = 0;
+		}
+
+		void initArrays(int n, const int *hst_idata)
+		{
+			int size = n*sizeof(int);
+
+			cudaMalloc((void**)&dev_o, size);
+			//checkCUDAError("cudaMalloc dev_o failed");
+			cudaMemcpy(dev_o, hst_idata, size, cudaMemcpyHostToDevice);
+			//checkCUDAError("cudaMemcpy odata->dev_o failed");
+
+		}
+
+		void freeArrays()
+		{
+			cudaFree(dev_o);
+		}
+
+		void scan(int n, int *odata, const int *idata) {
+			// TO_DOne
+			int N = ilog2ceil(n);
+			N = pow(2,N);
+
+			
+			dim3 threadsPerBlock(blockSize_eff);
+
+			initArrays(N, idata);
+
+			cudaEvent_t start, stop;
+			cudaEventCreate(&start);
+			cudaEventCreate(&stop);
+
+			cudaEventRecord(start);
+			for (int d = 0; d <= ilog2ceil(N) - 1; d++)
+			{
+				int pow2d = pow(2, d);
+				int end = (N - 1) / (2 * pow2d)+1;
+				dim3 fullBlocksPerGrid((end + blockSize_eff - 1) / blockSize_eff);
+				kernUpSweep <<<fullBlocksPerGrid, threadsPerBlock >>>(pow2d, dev_o, end);//later:blocksize,gridsize
+
+			}
+			setRootZero<<<1,1>>>(dev_o, N);
+			for (int d = ilog2ceil(N) - 1; d >= 0; d--)
+			{
+				int pow2d = pow(2, d);
+				int end = (N - 1) / (2 * pow2d) + 1;
+				dim3 fullBlocksPerGrid((end + blockSize_eff - 1) / blockSize_eff);
+				kernDownSweep <<<fullBlocksPerGrid, threadsPerBlock>>>(pow2d, dev_o,N,end);
+			}
+			cudaEventRecord(stop);
+			cudaEventSynchronize(stop);
+			float milliseconds = 0;
+			cudaEventElapsedTime(&milliseconds, start, stop);
+
+			cudaMemcpy(odata, dev_o, N*sizeof(int), cudaMemcpyDeviceToHost);
+
+			printf("\t GPU time for efficient scan : %.4fms\n", milliseconds);
+			
+			freeArrays();
+		}
+
+		/**
+		* Performs stream compaction on idata, storing the result into odata.
+		* All zeroes are discarded.
+		*
+		* @param n      The number of elements in idata.
+		* @param odata  The array into which to store elements.
+		* @param idata  The array of elements to compact.
+		* @returns      The number of elements remaining after compaction.
+		*/
+		int * dev_temp;
+		int * dev_scan;
+		int * dev_compactOut;
+		int * dev_input;
+
+		void freeCompArrays()
+		{
+			cudaFree(dev_temp);
+			cudaFree(dev_scan);
+			cudaFree(dev_compactOut);
+			cudaFree(dev_input);
+		}
+
+		__global__ void kernMapToBoolean(int * dev_idata, int *dev_outTemp, int n)
+		{
+			int index = threadIdx.x;
+			if (dev_idata[index] != 0 && index<n)
+				dev_outTemp[index] = 1;
+			else
+				dev_outTemp[index] = 0;
+
+		}
+		__global__ void kernScatter(int *dev_t, int *dev_s, int *dev_in, int *dev_outCompact)
+		{
+			int index = threadIdx.x;
+			if (dev_t[index] == 1)
+			{
+				dev_outCompact[dev_s[index]] = dev_in[index];
+			}
+		}
+
+		int compact(int n, int *odata, const int *idata) {
+
+			int N = ilog2ceil(n);
+			N = pow(2, N);
+			//int N = n;
+
+			int size = N*sizeof(int);
+
+			cudaMalloc((void**)&dev_temp, size);
+
+			cudaMalloc((void**)&dev_input, size);
+			cudaMemcpy(dev_input, idata, size, cudaMemcpyHostToDevice);
+
+			cudaMalloc((void**)&dev_scan, size);
+
+
+			//Step 1 : Compute temporary array.
+			kernMapToBoolean << <1, N >> >(dev_input, dev_temp, n);//later:blocksize,gridsize 
+			cudaMemcpy(dev_scan, dev_temp, size, cudaMemcpyDeviceToDevice);
+
+			//Step 2 : Run exclusive scan
+
+			for (int d = 0; d <= ilog2ceil(N) - 1; d++)
+			{
+				int pow2d = pow(2, d);
+				int end = (N - 1) / (2 * pow2d) + 1;
+				kernUpSweep << <1, end >> >(pow2d, dev_scan,end);//later:blocksize,gridsize
+			}
+
+			setRootZero << <1, 1 >> >(dev_scan, N);
+
+			for (int d = ilog2ceil(N) - 1; d >= 0; d--)
+			{
+				int pow2d = pow(2, d);
+				int end = (N - 1) / (2 * pow2d) + 1;
+				kernDownSweep << <1, end >> >(pow2d, dev_scan, N,end);
+			}
+
+			//Step 3 : Scatter
+			int compactLength;
+			cudaMemcpy(&compactLength, &(dev_scan[N - 1]), sizeof(int), cudaMemcpyDeviceToHost);
+			cudaMalloc((void**)&dev_compactOut, compactLength*sizeof(int));
+
+			kernScatter << <1, N >> >(dev_temp, dev_scan, dev_input, dev_compactOut);//later:blocksize,gridsize
+
+			//cudaMemcpy(odata,dev_compactOut,compactLength*sizeof(int),cudaMemcpyDeviceToHost);
+
+			cudaMemcpy(odata, dev_compactOut, compactLength*sizeof(int), cudaMemcpyDeviceToHost);
+			// TO_DOne
+			freeCompArrays();
+			return compactLength;
+		}
+
+	}
+}
\ No newline at end of file
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 3d86b60..3ee5846 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -4,17 +4,69 @@
 #include "naive.h"
 
 namespace StreamCompaction {
-namespace Naive {
+	namespace Naive {
 
-// TODO: __global__
+		// TODO: __global__
 
-/**
- * Performs prefix-sum (aka scan) on idata, storing the result into odata.
- */
-void scan(int n, int *odata, const int *idata) {
-    // TODO
-    printf("TODO\n");
-}
+		/**
+		 * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+		 */
+		int * dev_o;
 
-}
+		__global__ void kernNaiveScan(int pow2d_1, int *dev_odata)
+		{
+			int k = blockIdx.x * blockDim.x + threadIdx.x;
+			if (k >= pow2d_1)
+				dev_odata[k] = dev_odata[k - pow2d_1] + dev_odata[k];
+		}
+
+		void initArrays(int n, const int *hst_idata)
+		{
+			int size = n*sizeof(int);
+
+			cudaMalloc((void**)&dev_o, size);
+			//checkCUDAError("cudaMalloc dev_o failed");
+			cudaMemcpy(dev_o, hst_idata, size, cudaMemcpyHostToDevice);
+			//checkCUDAError("cudaMemcpy odata->dev_o failed");
+
+		}
+
+		void freeArrays()
+		{
+			cudaFree(dev_o);
+		}
+
+		void scan(int n, int *odata, const int *idata) {
+			dim3 threadsPerBlock(blockSize_naive);
+			dim3 fullBlocksPerGrid((n + blockSize_naive - 1) / blockSize_naive);
+			//??? inclusive or exclusive ? not exactly 39.2/slides
+			initArrays(n, idata);
+
+			cudaEvent_t start, stop;
+			cudaEventCreate(&start);
+			cudaEventCreate(&stop);
+
+			cudaEventRecord(start);
+			for (int d = 1; d <= ilog2ceil(n); d++)
+			{
+				int pow2_d_1 = pow(2, d - 1);
+				kernNaiveScan <<<fullBlocksPerGrid, threadsPerBlock >>>(pow2_d_1, dev_o); //later:blocksize,gridsize
+			}
+			cudaEventRecord(stop);
+			cudaEventSynchronize(stop);
+			float milliseconds = 0;
+			cudaEventElapsedTime(&milliseconds, start, stop);
+			printf("\t GPU time for naive scan : %.4fms\n",milliseconds);
+			// TO_DOne
+			cudaMemcpy(odata, dev_o, n*sizeof(int), cudaMemcpyDeviceToHost);
+			//inclusive to exclusive
+			for (int i = n-1; i >0; i--)
+			{
+				odata[i] = odata[i - 1];
+			}
+			odata[0] = 0;
+			freeArrays();
+		}
+
+	}
 }
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index d8dbb32..ba09626 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -7,16 +7,143 @@
 #include "thrust.h"
 
 namespace StreamCompaction {
-namespace Thrust {
-
-/**
- * Performs prefix-sum (aka scan) on idata, storing the result into odata.
- */
-void scan(int n, int *odata, const int *idata) {
-    // TODO use `thrust::exclusive_scan`
-    // example: for device_vectors dv_in and dv_out:
-    // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
-}
-
-}
-}
+	namespace Thrust {
+
+		/**
+		* Performs prefix-sum (aka scan) on idata, storing the result into odata.
+		*/
+		void scan(int n, int *odata, const int *idata) {
+			// TO_DOne use `thrust::exclusive_scan`
+			// example: for device_vectors dv_in and dv_out:
+			// thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+
+
+			//thrust::host_vector<int> hst_in(n);
+			//hst_in.resize(n);
+
+			thrust::device_vector<int> dv_in(idata, idata + n);
+			thrust::device_vector<int> dv_out = dv_in;
+
+
+			//thrust::copy_n(idata, n, dv_in);
+			cudaEvent_t start, stop;
+			cudaEventCreate(&start);
+			cudaEventCreate(&stop);
+
+			cudaEventRecord(start);
+			thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+			cudaEventRecord(stop);
+
+			cudaEventSynchronize(stop);
+			float milliseconds = 0;
+			cudaEventElapsedTime(&milliseconds, start, stop);
+			printf("\t GPU time for thrust scan : %.4fms\n", milliseconds);
+			thrust::copy(dv_out.begin(), dv_out.end(), odata);
+
+		}
+
+	}
+	namespace RadixSort
+	{
+		__global__ void kernCalc_e(
+			thrust::detail::normal_iterator<thrust::device_ptr<int>> dev_iArray,
+			thrust::detail::normal_iterator<thrust::device_ptr<int>> dev_eArray,
+			int n
+			)
+		{
+			int index = blockIdx.x*blockDim.x + threadIdx.x;
+			if (index >= n) return;
+			int LSB = dev_iArray[index] % 2;
+			dev_eArray[index] = (LSB == 1) ? 0 : 1;
+		}
+		__global__ void kernCalc_t(
+			thrust::detail::normal_iterator<thrust::device_ptr<int>> dv_fArray,
+			thrust::detail::normal_iterator<thrust::device_ptr<int>> dv_tArray,
+			int totalFalses
+			)
+		{
+			int index = threadIdx.x;
+
+			dv_tArray[index] = index - dv_fArray[index] + totalFalses;
+		}
+		__global__ void kernCalc_d(
+			thrust::detail::normal_iterator<thrust::device_ptr<int>> dv_fArray,
+			thrust::detail::normal_iterator<thrust::device_ptr<int>> dv_tArray,
+			thrust::detail::normal_iterator<thrust::device_ptr<int>> dv_eArray,
+			thrust::detail::normal_iterator<thrust::device_ptr<int>> dv_dArray
+			)
+		{
+			int index = threadIdx.x;
+			bool bi = (dv_eArray[index] == 1) ? false : true;
+			dv_dArray[index] = bi ? dv_tArray[index] : dv_fArray[index];
+		}
+
+		__global__ void kernScatterBasedOn_d(
+			thrust::detail::normal_iterator<thrust::device_ptr<int>> dv_iArray,
+			thrust::detail::normal_iterator<thrust::device_ptr<int>> dv_oArray,
+			thrust::detail::normal_iterator<thrust::device_ptr<int>> dv_dArray
+			)
+		{
+			int index = threadIdx.x;
+			dv_oArray[dv_dArray[index]] = dv_iArray[index];
+		}
+
+		void sort(int n, int *odata, const int *idata)
+		{
+			//initArrays_RS(n, idata);
+			thrust::device_vector<int> dv_i(idata, idata + n);
+			thrust::device_vector<int> dv_e = dv_i;
+			thrust::device_vector<int> dv_f = dv_e;
+			thrust::device_vector<int> dv_t = dv_f;
+			thrust::device_vector<int> dv_d = dv_f;
+			thrust::device_vector<int> dv_o = dv_d;
+
+			dim3 fullBlocksPerGrid((n + blockSize_thrust - 1) / blockSize_thrust);
+			dim3 threadsPerBlock(blockSize_thrust);
+
+			// Step 1 : Compute e array
+			kernCalc_e <<<fullBlocksPerGrid, threadsPerBlock >>>(dv_i.begin(), dv_e.begin(), n);
+			/*
+			thrust::copy_n(dv_e.begin(), n, odata);
+			printf("\n****** step 1 : dv_e\n----[");
+			for (int i = 0; i < n; i++)
+			{
+				printf(" %3d", odata[i]);
+			}
+			printf("]\n"); //*/
+			// Step 2 : Exclusive Scan e
+			thrust::exclusive_scan(dv_e.begin(), dv_e.end(), dv_f.begin());
+			/*
+			thrust::copy_n(dv_f.begin(), n, odata);
+			printf("\n****** step 2 : dv_f\n----[");
+			for (int i = 0; i < n; i++)
+			{
+				printf(" %3d", odata[i]);
+			}
+			printf("]\n"); //*/
+
+			//Step 3 : Compute totalFalses
+			int totalFalses = dv_e[n - 1] + dv_f[n - 1];
+
+			//Step 4 : Compute t array
+			kernCalc_t << <1, n >> >(dv_f.begin(), dv_t.begin(), totalFalses);
+
+
+			// Step 5 : Scatter based on d
+			kernCalc_d << <1, n >> >(dv_f.begin(), dv_t.begin(), dv_e.begin(), dv_d.begin());
+
+
+			kernScatterBasedOn_d << <1, n >> >(dv_i.begin(), dv_o.begin(), dv_d.begin());
+
+			
+			thrust::copy_n(dv_o.begin(), n, odata);
+			/*printf("\n****** final : dv_o\n----[");
+			for (int i = 0; i < n; i++)
+			{
+				printf(" %3d", odata[i]);
+			}
+			printf("]\n"); */
+
+		}
+	}
+}
\ No newline at end of file
diff --git a/stream_compaction/thrust.h b/stream_compaction/thrust.h
index 06707f3..372d4e3 100644
--- a/stream_compaction/thrust.h
+++ b/stream_compaction/thrust.h
@@ -1,7 +1,11 @@
 #pragma once
 
 namespace StreamCompaction {
-namespace Thrust {
-    void scan(int n, int *odata, const int *idata);
-}
+	namespace Thrust {
+		void scan(int n, int *odata, const int *idata);
+	}
+	namespace RadixSort
+	{
+		void sort(int n, int *odata, const int *idata);
+	}
 }