diff --git a/README.md b/README.md index a82ea0f..e96271a 100644 --- a/README.md +++ b/README.md @@ -3,211 +3,116 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) - -### (TODO: Your README) - -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) - -Instructions (delete me) -======================== - -This is due Sunday, September 13 at midnight. - -**Summary:** In this project, you'll implement GPU stream compaction in CUDA, -from scratch. This algorithm is widely used, and will be important for -accelerating your path tracer project. - -Your stream compaction implementations in this project will simply remove `0`s -from an array of `int`s. In the path tracer, you will remove terminated paths -from an array of rays. - -In addition to being useful for your path tracer, this project is meant to -reorient your algorithmic thinking to the way of the GPU. On GPUs, many -algorithms can benefit from massive parallelism and, in particular, data -parallelism: executing the same code many times simultaneously with different -data. - -You'll implement a few different versions of the *Scan* (*Prefix Sum*) -algorithm. First, you'll implement a CPU version of the algorithm to reinforce -your understanding. Then, you'll write a few GPU implementations: "naive" and -"work-efficient." Finally, you'll use some of these to implement GPU stream -compaction. - -**Algorithm overview & details:** There are two primary references for details -on the implementation of scan and stream compaction. - -* The [slides on Parallel Algorithms](https://github.com/CIS565-Fall-2015/cis565-fall-2015.github.io/raw/master/lectures/2-Parallel-Algorithms.pptx) - for Scan, Stream Compaction, and Work-Efficient Parallel Scan. -* GPU Gems 3, Chapter 39 - [Parallel Prefix Sum (Scan) with CUDA](http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html). - -Your GPU stream compaction implementation will live inside of the -`stream_compaction` subproject. This way, you will be able to easily copy it -over for use in your GPU path tracer. - - -## Part 0: The Usual - -This project (and all other CUDA projects in this course) requires an NVIDIA -graphics card with CUDA capability. Any card with Compute Capability 2.0 -(`sm_20`) or greater will work. Check your GPU on this -[compatibility table](https://developer.nvidia.com/cuda-gpus). -If you do not have a personal machine with these specs, you may use those -computers in the Moore 100B/C which have supported GPUs. - -**HOWEVER**: If you need to use the lab computer for your development, you will -not presently be able to do GPU performance profiling. This will be very -important for debugging performance bottlenecks in your program. - -### Useful existing code - -* `stream_compaction/common.h` - * `checkCUDAError` macro: checks for CUDA errors and exits if there were any. - * `ilog2ceil(x)`: computes the ceiling of log2(x), as an integer. -* `main.cpp` - * Some testing code for your implementations. - - -## Part 1: CPU Scan & Stream Compaction - -This stream compaction method will remove `0`s from an array of `int`s. - -In `stream_compaction/cpu.cu`, implement: - -* `StreamCompaction::CPU::scan`: compute an exclusive prefix sum. -* `StreamCompaction::CPU::compactWithoutScan`: stream compaction without using - the `scan` function. -* `StreamCompaction::CPU::compactWithScan`: stream compaction using the `scan` - function. Map the input array to an array of 0s and 1s, scan it, and use - scatter to produce the output. You will need a **CPU** scatter implementation - for this (see slides or GPU Gems chapter for an explanation). - -These implementations should only be a few lines long. - - -## Part 2: Naive GPU Scan Algorithm - -In `stream_compaction/naive.cu`, implement `StreamCompaction::Naive::scan` - -This uses the "Naive" algorithm from GPU Gems 3, Section 39.2.1. We haven't yet -taught shared memory, and you **shouldn't use it yet**. Example 39-1 uses -shared memory, but is limited to operating on very small arrays! Instead, write -this using global memory only. As a result of this, you will have to do -`ilog2ceil(n)` separate kernel invocations. - -Beware of errors in Example 39-1 in the book; both the pseudocode and the CUDA -code in the online version of Chapter 39 are known to have a few small errors -(in superscripting, missing braces, bad indentation, etc.) - -Since the parallel scan algorithm operates on a binary tree structure, it works -best with arrays with power-of-two length. Make sure your implementation works -on non-power-of-two sized arrays (see `ilog2ceil`). This requires extra memory -- your intermediate array sizes will need to be rounded to the next power of -two. - - -## Part 3: Work-Efficient GPU Scan & Stream Compaction - -### 3.1. Scan - -In `stream_compaction/efficient.cu`, implement -`StreamCompaction::Efficient::scan` - -All of the text in Part 2 applies. - -* This uses the "Work-Efficient" algorithm from GPU Gems 3, Section 39.2.2. -* Beware of errors in Example 39-2. -* Test non-power-of-two sized arrays. - -### 3.2. Stream Compaction - -This stream compaction method will remove `0`s from an array of `int`s. - -In `stream_compaction/efficient.cu`, implement -`StreamCompaction::Efficient::compact` - -For compaction, you will also need to implement the scatter algorithm presented -in the slides and the GPU Gems chapter. - -In `stream_compaction/common.cu`, implement these for use in `compact`: - -* `StreamCompaction::Common::kernMapToBoolean` -* `StreamCompaction::Common::kernScatter` - - -## Part 4: Using Thrust's Implementation - -In `stream_compaction/thrust.cu`, implement: - -* `StreamCompaction::Thrust::scan` - -This should be a very short function which wraps a call to the Thrust library -function `thrust::exclusive_scan(first, last, result)`. - -To measure timing, be sure to exclude memory operations by passing -`exclusive_scan` a `thrust::device_vector` (which is already allocated on the -GPU). You can create a `thrust::device_vector` by creating a -`thrust::host_vector` from the given pointer, then casting it. - - -## Part 5: Radix Sort (Extra Credit) (+10) - -Add an additional module to the `stream_compaction` subproject. Implement radix -sort using one of your scan implementations. Add tests to check its correctness. - - -## Write-up - -1. Update all of the TODOs at the top of this README. -2. Add a description of this project including a list of its features. -3. Add your performance analysis (see below). - -All extra credit features must be documented in your README, explaining its -value (with performance comparison, if applicable!) and showing an example how -it works. For radix sort, show how it is called and an example of its output. - -Always profile with Release mode builds and run without debugging. - -### Questions - -* Roughly optimize the block sizes of each of your implementations for minimal - run time on your GPU. - * (You shouldn't compare unoptimized implementations to each other!) - -* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and - Thrust) to the serial CPU version of Scan. Plot a graph of the comparison - (with array size on the independent axis). - * You should use CUDA events for timing. Be sure **not** to include any - explicit memory operations in your performance measurements, for - comparability. - * To guess at what might be happening inside the Thrust implementation, take - a look at the Nsight timeline for its execution. - -* Write a brief explanation of the phenomena you see here. - * Can you find the performance bottlenecks? Is it memory I/O? Computation? Is - it different for each implementation? - -* Paste the output of the test program into a triple-backtick block in your - README. - * If you add your own tests (e.g. for radix sort or to test additional corner - cases), be sure to mention it explicitly. - -These questions should help guide you in performance analysis on future -assignments, as well. - -## Submit - -If you have modified any of the `CMakeLists.txt` files at all (aside from the -list of `SOURCE_FILES`), you must test that your project can build in Moore -100B/C. Beware of any build issues discussed on the Google Group. - -1. Open a GitHub pull request so that we can see that you have finished. - The title should be "Submission: YOUR NAME". -2. Send an email to the TA (gmail: kainino1+cis565@) with: - * **Subject**: in the form of `[CIS565] Project 2: PENNKEY` - * Direct link to your pull request on GitHub - * In the form of a grade (0-100+) with comments, evaluate your own - performance on the project. - * Feedback on the project itself, if any. +* Ziwei Zong +* Tested on: Windows 10, i7-5500 @ 2.40GHz 8GB, GTX 950M (Personal) + +Descriptions +-------------------------- + * Part 1 : CPU Scan & Compaction (see file cpu.cu) + * Part 2 : Naive Scan (naive.cu) + * Part 3 : Work-Efficient Scan & Compaction (efficient.cu) + * Part 4 : Thrust Scan (thrust.cu) + * Part 5 : Radix Sort (in file thrust.cu, RadixSort::sort) + +Block Sizes Optimization +-------------------------- +| | 32 | 64 | 128 | 256 | 512 | 1024| +|------------|-----|-----|-----|-----|-----|-----| +| block_naive|0.062|0.061|0.060|0.062|0.064|0.078| +| block_eff|0.139|0.139|0.140|0.142|0.148|0.155| +|block_thrust|1.060|1.180|1.200|1.100|1.029|1.090| +(ms) +Thus, I choose block size 128 for naive scan, 64 for efficient scan and 512 for thrust scan. + +CPU & GPU Performance Comparison +-------------------------------- +|Array_length| 2^4| 2^6| 2^8| 2^10| 2^12| +|------------|---------|-------|-------|--------|-------| +|cpu |power of 2 |0.0024 |0.002304 |0.002304 |0.002304 |0.002304| +|cpu |! Power of 2 |0.0023 |0.002304 |0.002304 |0.002336 |0.002336| +|gpu naïve |power of 2 |0.027 |0.0374 |0.0496 | 0.0609 |0.0766| +|gpu naïve |! Power of 2 |0.0252 |0.0355 |0.0473 |0.0587 |0.0745| +|gpu efficient |power of 2 |0.0541 |0.0765 |0.0985 |0.1312 |0.1551| +|gpu efficient |! Power of 2 |0.053 |0.0748 |0.0969 |0.1307 |0.1537| +|gpu thrust |power of 2 |0.03293|0.3129 |0.3414 | 0.3854 |0.3011| +|gpu thrust |! Power of 2 |0.3039 |0.3476 |0.2955 | 0.3535 | 0.3035| +(ms) +Graph is shown below. +![](images/power of 2.png) +![](images/not power of 2.png) +- As the line chart shows, the calculation time for thrust scan doesn't have any specific rules while naive and efficient scan time perform power function correlation with the length of array. +- Weirdly, the time for CPU is very small no matter how long the array is. I was assumming that it was not right to count CPU time using Cuda event so I then tried to use std::clock. However, the result time is even smaller(0.0). Therefore, CPU did perform better than GPU for this task, and that might due to the high cost for memory accessing. +- The naive scan method also performs better than the efficient scan which is not as expected. + + +Output +-------------------------- + +**************** +** SCAN TESTS ** +**************** + + [ 38 19 38 37 5 47 15 35 0 12 3 0 42 ... 26 0 ] + ==== cpu scan, power-of-two ==== + StreamCompaction::CPU::scan : exclusive prefix sum. + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 6203 6229 ] + ==== cpu scan, non-power-of-two ==== + StreamCompaction::CPU::scan : exclusive prefix sum. + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 6146 6190 ] + passed + ==== naive scan, power-of-two ==== + GPU time for naive scan : 0.0696ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 6203 6229 ] + passed + ==== naive scan, non-power-of-two ==== + GPU time for naive scan : 0.0676ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 6146 6190 ] + passed + ==== work-efficient scan, power-of-two ==== + [ 38 19 38 37 5 47 15 35 0 12 3 0 42 ... 26 0 ] + GPU time for efficient scan : 0.1403ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 6203 6229 ] + passed + ==== work-efficient scan, non-power-of-two ==== + [ 38 19 38 37 5 47 15 35 0 12 3 0 42 ... 44 8 ] + GPU time for efficient scan : 0.1403ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 6146 6190 ] + passed + ==== thrust scan, power-of-two ==== + GPU time for thrust scan : 128.8397ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 6203 6229 ] + passed + ==== thrust scan, non-power-of-two ==== + GPU time for thrust scan : 1.1305ms + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 6146 6190 ] + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 2 3 2 1 3 1 1 1 2 0 1 0 2 ... 0 0 ] + ==== cpu compact without scan, power-of-two ==== + [ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 1 ] + passed + ==== cpu compact without scan, non-power-of-two ==== + [ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 3 2 ] + passed + ==== cpu compact with scan ==== + StreamCompaction::CPU::scan : exclusive prefix sum. + [ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 1 ] + passed + ==== work-efficient compact, power-of-two ==== + [ 2 3 2 1 3 1 1 1 2 0 1 0 2 ... 0 0 ] + [ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 1 ] + passed + ==== work-efficient compact, non-power-of-two ==== + [ 2 3 2 1 3 1 1 1 2 0 1 0 2 ... 0 0 ] + [ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 3 2 ] + passed + +***************************** +** Radix Sort ** +***************************** + ==== Radix Sort, power-of-two ==== + [ 4 7 2 6 3 5 1 0 ] + [ 4 2 6 0 7 3 5 1 ] diff --git a/images/not power of 2.png b/images/not power of 2.png new file mode 100644 index 0000000..6297623 Binary files /dev/null and b/images/not power of 2.png differ diff --git a/images/power of 2.png b/images/power of 2.png new file mode 100644 index 0000000..050a6f4 Binary files /dev/null and b/images/power of 2.png differ diff --git a/src/main.cpp b/src/main.cpp index 7308451..9834fc7 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -12,33 +12,35 @@ #include #include #include "testing_helpers.hpp" +#include +#include int main(int argc, char* argv[]) { - const int SIZE = 1 << 8; + const int SIZE = 1 << 10; const int NPOT = SIZE - 3; int a[SIZE], b[SIZE], c[SIZE]; // Scan tests - + printf("\n"); printf("****************\n"); printf("** SCAN TESTS **\n"); printf("****************\n"); genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case + a[SIZE - 1] = 0; printArray(SIZE, a, true); - zeroArray(SIZE, b); printDesc("cpu scan, power-of-two"); StreamCompaction::CPU::scan(SIZE, b, a); - printArray(SIZE, b, true); + //printArray(SIZE, b, true); zeroArray(SIZE, c); printDesc("cpu scan, non-power-of-two"); StreamCompaction::CPU::scan(NPOT, c, a); - printArray(NPOT, b, true); + //printArray(NPOT, b, true); printCmpResult(NPOT, b, c); - + zeroArray(SIZE, c); printDesc("naive scan, power-of-two"); StreamCompaction::Naive::scan(SIZE, c, a); @@ -48,15 +50,14 @@ int main(int argc, char* argv[]) { zeroArray(SIZE, c); printDesc("naive scan, non-power-of-two"); StreamCompaction::Naive::scan(NPOT, c, a); - //printArray(SIZE, c, true); + //printArray(NPOT, c, true); printCmpResult(NPOT, b, c); - + zeroArray(SIZE, c); printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); //printArray(SIZE, c, true); printCmpResult(SIZE, b, c); - zeroArray(SIZE, c); printDesc("work-efficient scan, non-power-of-two"); StreamCompaction::Efficient::scan(NPOT, c, a); @@ -74,7 +75,7 @@ int main(int argc, char* argv[]) { StreamCompaction::Thrust::scan(NPOT, c, a); //printArray(NPOT, c, true); printCmpResult(NPOT, b, c); - + //* printf("\n"); printf("*****************************\n"); printf("** STREAM COMPACTION TESTS **\n"); @@ -84,9 +85,8 @@ int main(int argc, char* argv[]) { genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case printArray(SIZE, a, true); - + a[SIZE - 1] = 0; int count, expectedCount, expectedNPOT; - zeroArray(SIZE, b); printDesc("cpu compact without scan, power-of-two"); count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a); @@ -109,13 +109,37 @@ int main(int argc, char* argv[]) { zeroArray(SIZE, c); printDesc("work-efficient compact, power-of-two"); + printArray(SIZE, a, true);//delete count = StreamCompaction::Efficient::compact(SIZE, c, a); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); printDesc("work-efficient compact, non-power-of-two"); + printArray(SIZE, a, true);//delete count = StreamCompaction::Efficient::compact(NPOT, c, a); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); + + printf("\n"); + printf("*****************************\n"); + printf("** Radix Sort **\n"); + printf("*****************************\n"); + + a[0] = 4; + a[1] = 7; + a[2] = 2; + a[3] = 6; + a[4] = 3; + a[5] = 5; + a[6] = 1; + a[7] = 0; + + zeroArray(8, c); + printDesc("Radix Sort, power-of-two"); + printArray(8, a, true);//delete + StreamCompaction::RadixSort::sort(8, c, a); + printArray(8, c, true); + //printCmpResult(SIZE, b, c);*/ + std::cin.get(); } diff --git a/stream_compaction/common.h b/stream_compaction/common.h index 4f52663..698a919 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -7,6 +7,10 @@ #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) +#define blockSize_eff 64 +#define blockSize_naive 128 +#define blockSize_thrust 512 + /** * Check for CUDA errors; print and exit if there was a problem. */ diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index e600c29..cc8e3a6 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -1,5 +1,8 @@ #include #include "cpu.h" +#include "common.h" +#include +#include namespace StreamCompaction { namespace CPU { @@ -8,8 +11,40 @@ namespace CPU { * CPU scan (prefix sum). */ void scan(int n, int *odata, const int *idata) { - // TODO - printf("TODO\n"); + cudaEvent_t start, stop; + cudaEventCreate(&start); + cudaEventCreate(&stop); + + cudaEventRecord(start); + std::clock_t t1, t2; + t1 = std::clock(); + + double duration; + + for (int i = 0; i < n; i++) + { + if (i==0) + { + odata[i] = 0; + continue; + } + odata[i] = odata[i - 1] + idata[i - 1]; + } + t2 = std::clock(); + + duration = (float)t2 - (float)t1; + + cudaEventRecord(stop); + + cudaEventSynchronize(stop); + float milliseconds = 0; + cudaEventElapsedTime(&milliseconds, start, stop); + + printf("\t(CudaEvent)CPU time for scan : %3fms\n",milliseconds); + printf("\t(Clock)CPU time for scan : %3fms\n", duration); + + // TO_DOne + //printf("StreamCompaction::CPU::scan : exclusive prefix sum.\n"); } /** @@ -18,8 +53,16 @@ void scan(int n, int *odata, const int *idata) { * @returns the number of elements remaining after compaction. */ int compactWithoutScan(int n, int *odata, const int *idata) { - // TODO - return -1; + int k = 0; + for (int i = 0; i < n; i++) + { + if (idata[i] != 0) + { + odata[k++] = idata[i]; + } + } + // TO__DOne + return k; } /** @@ -28,8 +71,36 @@ int compactWithoutScan(int n, int *odata, const int *idata) { * @returns the number of elements remaining after compaction. */ int compactWithScan(int n, int *odata, const int *idata) { - // TODO - return -1; + // TO_DOne + int*TempArray = new int[n+1]; + int*ScanArray = new int[n+1]; + for (int i = 0; i < n; i++) + { + if (idata[i] != 0) + { + TempArray[i] = 1; + } + else + { + TempArray[i] = 0; + } + } + TempArray[n] = 0; + scan(n+1, ScanArray, TempArray); + + int k = 0; + for (int i = 0; i < n; i++) + { + if (TempArray[i]==1) + { + odata[ScanArray[i]] = idata[i]; + k++; + } + } + int count = ScanArray[n]; + delete[] TempArray; + delete[] ScanArray; + return count; } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index b2f739b..81aa40e 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -4,31 +4,190 @@ #include "efficient.h" namespace StreamCompaction { -namespace Efficient { - -// TODO: __global__ - -/** - * Performs prefix-sum (aka scan) on idata, storing the result into odata. - */ -void scan(int n, int *odata, const int *idata) { - // TODO - printf("TODO\n"); -} - -/** - * Performs stream compaction on idata, storing the result into odata. - * All zeroes are discarded. - * - * @param n The number of elements in idata. - * @param odata The array into which to store elements. - * @param idata The array of elements to compact. - * @returns The number of elements remaining after compaction. - */ -int compact(int n, int *odata, const int *idata) { - // TODO - return -1; -} - -} -} + namespace Efficient { + + // TODO: __global__ + + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata. + */ + int * dev_o; + + __global__ void kernUpSweep(int pow2d, int * dev_odata,int end) + { + int k = blockIdx.x * blockDim.x + threadIdx.x; + if (k > end) return; + dev_odata[k * 2 * pow2d + (int)pow2d * 2 - 1] += dev_odata[k * 2 * pow2d + (int)pow2d - 1]; + } + + __global__ void kernDownSweep(int pow2d, int * dev_odata,int n,int end) + { + + int k = (blockIdx.x * blockDim.x + threadIdx.x); + if (k > end) return; + k = k * 2 * pow2d; + //dev_odata[k * 2 * pow2d + (int)pow2d * 2 - 1] += dev_odata[k * 2 * pow2d + (int)pow2d - 1]; + int t = dev_odata[k + pow2d - 1]; + dev_odata[k + pow2d - 1] = dev_odata[k + pow2d * 2 - 1]; + dev_odata[k + pow2d * 2 - 1] += t; + + } + __global__ void setRootZero(int * dev_odata,int n) + { + dev_odata[n - 1] = 0; + } + + void initArrays(int n, const int *hst_idata) + { + int size = n*sizeof(int); + + cudaMalloc((void**)&dev_o, size); + //checkCUDAError("cudaMalloc dev_o failed"); + cudaMemcpy(dev_o, hst_idata, size, cudaMemcpyHostToDevice); + //checkCUDAError("cudaMemcpy odata->dev_o failed"); + + } + + void freeArrays() + { + cudaFree(dev_o); + } + + void scan(int n, int *odata, const int *idata) { + // TO_DOne + int N = ilog2ceil(n); + N = pow(2,N); + + + dim3 threadsPerBlock(blockSize_eff); + + initArrays(N, idata); + + cudaEvent_t start, stop; + cudaEventCreate(&start); + cudaEventCreate(&stop); + + cudaEventRecord(start); + for (int d = 0; d <= ilog2ceil(N) - 1; d++) + { + int pow2d = pow(2, d); + int end = (N - 1) / (2 * pow2d)+1; + dim3 fullBlocksPerGrid((end + blockSize_eff - 1) / blockSize_eff); + kernUpSweep <<>>(pow2d, dev_o, end);//later:blocksize,gridsize + + } + setRootZero<<<1,1>>>(dev_o, N); + for (int d = ilog2ceil(N) - 1; d >= 0; d--) + { + int pow2d = pow(2, d); + int end = (N - 1) / (2 * pow2d) + 1; + dim3 fullBlocksPerGrid((end + blockSize_eff - 1) / blockSize_eff); + kernDownSweep <<>>(pow2d, dev_o,N,end); + } + cudaEventRecord(stop); + cudaEventSynchronize(stop); + float milliseconds = 0; + cudaEventElapsedTime(&milliseconds, start, stop); + + cudaMemcpy(odata, dev_o, N*sizeof(int), cudaMemcpyDeviceToHost); + + printf("\t GPU time for efficient scan : %.4fms\n", milliseconds); + + freeArrays(); + } + + /** + * Performs stream compaction on idata, storing the result into odata. + * All zeroes are discarded. + * + * @param n The number of elements in idata. + * @param odata The array into which to store elements. + * @param idata The array of elements to compact. + * @returns The number of elements remaining after compaction. + */ + int * dev_temp; + int * dev_scan; + int * dev_compactOut; + int * dev_input; + + void freeCompArrays() + { + cudaFree(dev_temp); + cudaFree(dev_scan); + cudaFree(dev_compactOut); + cudaFree(dev_input); + } + + __global__ void kernMapToBoolean(int * dev_idata, int *dev_outTemp, int n) + { + int index = threadIdx.x; + if (dev_idata[index] != 0 && index> >(dev_input, dev_temp, n);//later:blocksize,gridsize + cudaMemcpy(dev_scan, dev_temp, size, cudaMemcpyDeviceToDevice); + + //Step 2 : Run exclusive scan + + for (int d = 0; d <= ilog2ceil(N) - 1; d++) + { + int pow2d = pow(2, d); + int end = (N - 1) / (2 * pow2d) + 1; + kernUpSweep << <1, end >> >(pow2d, dev_scan,end);//later:blocksize,gridsize + } + + setRootZero << <1, 1 >> >(dev_scan, N); + + for (int d = ilog2ceil(N) - 1; d >= 0; d--) + { + int pow2d = pow(2, d); + int end = (N - 1) / (2 * pow2d) + 1; + kernDownSweep << <1, end >> >(pow2d, dev_scan, N,end); + } + + //Step 3 : Scatter + int compactLength; + cudaMemcpy(&compactLength, &(dev_scan[N - 1]), sizeof(int), cudaMemcpyDeviceToHost); + cudaMalloc((void**)&dev_compactOut, compactLength*sizeof(int)); + + kernScatter << <1, N >> >(dev_temp, dev_scan, dev_input, dev_compactOut);//later:blocksize,gridsize + + //cudaMemcpy(odata,dev_compactOut,compactLength*sizeof(int),cudaMemcpyDeviceToHost); + + cudaMemcpy(odata, dev_compactOut, compactLength*sizeof(int), cudaMemcpyDeviceToHost); + // TO_DOne + freeCompArrays(); + return compactLength; + } + + } +} \ No newline at end of file diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 3d86b60..3ee5846 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -4,17 +4,69 @@ #include "naive.h" namespace StreamCompaction { -namespace Naive { + namespace Naive { -// TODO: __global__ + // TODO: __global__ -/** - * Performs prefix-sum (aka scan) on idata, storing the result into odata. - */ -void scan(int n, int *odata, const int *idata) { - // TODO - printf("TODO\n"); -} + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata. + */ + int * dev_o; -} + __global__ void kernNaiveScan(int pow2d_1, int *dev_odata) + { + int k = blockIdx.x * blockDim.x + threadIdx.x; + if (k >= pow2d_1) + dev_odata[k] = dev_odata[k - pow2d_1] + dev_odata[k]; + } + + void initArrays(int n, const int *hst_idata) + { + int size = n*sizeof(int); + + cudaMalloc((void**)&dev_o, size); + //checkCUDAError("cudaMalloc dev_o failed"); + cudaMemcpy(dev_o, hst_idata, size, cudaMemcpyHostToDevice); + //checkCUDAError("cudaMemcpy odata->dev_o failed"); + + } + + void freeArrays() + { + cudaFree(dev_o); + } + + void scan(int n, int *odata, const int *idata) { + dim3 threadsPerBlock(blockSize_naive); + dim3 fullBlocksPerGrid((n + blockSize_naive - 1) / blockSize_naive); + //??? inclusive or exclusive ? not exactly 39.2/slides + initArrays(n, idata); + + cudaEvent_t start, stop; + cudaEventCreate(&start); + cudaEventCreate(&stop); + + cudaEventRecord(start); + for (int d = 1; d <= ilog2ceil(n); d++) + { + int pow2_d_1 = pow(2, d - 1); + kernNaiveScan <<>>(pow2_d_1, dev_o); //later:blocksize,gridsize + } + cudaEventRecord(stop); + cudaEventSynchronize(stop); + float milliseconds = 0; + cudaEventElapsedTime(&milliseconds, start, stop); + printf("\t GPU time for naive scan : %.4fms\n",milliseconds); + // TO_DOne + cudaMemcpy(odata, dev_o, n*sizeof(int), cudaMemcpyDeviceToHost); + //inclusive to exclusive + for (int i = n-1; i >0; i--) + { + odata[i] = odata[i - 1]; + } + odata[0] = 0; + freeArrays(); + } + + } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index d8dbb32..ba09626 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -7,16 +7,143 @@ #include "thrust.h" namespace StreamCompaction { -namespace Thrust { - -/** - * Performs prefix-sum (aka scan) on idata, storing the result into odata. - */ -void scan(int n, int *odata, const int *idata) { - // TODO use `thrust::exclusive_scan` - // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); -} - -} -} + namespace Thrust { + + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata. + */ + void scan(int n, int *odata, const int *idata) { + // TO_DOne use `thrust::exclusive_scan` + // example: for device_vectors dv_in and dv_out: + // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + + + //thrust::host_vector hst_in(n); + //hst_in.resize(n); + + thrust::device_vector dv_in(idata, idata + n); + thrust::device_vector dv_out = dv_in; + + + //thrust::copy_n(idata, n, dv_in); + cudaEvent_t start, stop; + cudaEventCreate(&start); + cudaEventCreate(&stop); + + cudaEventRecord(start); + thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + cudaEventRecord(stop); + + cudaEventSynchronize(stop); + float milliseconds = 0; + cudaEventElapsedTime(&milliseconds, start, stop); + printf("\t GPU time for thrust scan : %.4fms\n", milliseconds); + thrust::copy(dv_out.begin(), dv_out.end(), odata); + + } + + } + namespace RadixSort + { + __global__ void kernCalc_e( + thrust::detail::normal_iterator> dev_iArray, + thrust::detail::normal_iterator> dev_eArray, + int n + ) + { + int index = blockIdx.x*blockDim.x + threadIdx.x; + if (index >= n) return; + int LSB = dev_iArray[index] % 2; + dev_eArray[index] = (LSB == 1) ? 0 : 1; + } + __global__ void kernCalc_t( + thrust::detail::normal_iterator> dv_fArray, + thrust::detail::normal_iterator> dv_tArray, + int totalFalses + ) + { + int index = threadIdx.x; + + dv_tArray[index] = index - dv_fArray[index] + totalFalses; + } + __global__ void kernCalc_d( + thrust::detail::normal_iterator> dv_fArray, + thrust::detail::normal_iterator> dv_tArray, + thrust::detail::normal_iterator> dv_eArray, + thrust::detail::normal_iterator> dv_dArray + ) + { + int index = threadIdx.x; + bool bi = (dv_eArray[index] == 1) ? false : true; + dv_dArray[index] = bi ? dv_tArray[index] : dv_fArray[index]; + } + + __global__ void kernScatterBasedOn_d( + thrust::detail::normal_iterator> dv_iArray, + thrust::detail::normal_iterator> dv_oArray, + thrust::detail::normal_iterator> dv_dArray + ) + { + int index = threadIdx.x; + dv_oArray[dv_dArray[index]] = dv_iArray[index]; + } + + void sort(int n, int *odata, const int *idata) + { + //initArrays_RS(n, idata); + thrust::device_vector dv_i(idata, idata + n); + thrust::device_vector dv_e = dv_i; + thrust::device_vector dv_f = dv_e; + thrust::device_vector dv_t = dv_f; + thrust::device_vector dv_d = dv_f; + thrust::device_vector dv_o = dv_d; + + dim3 fullBlocksPerGrid((n + blockSize_thrust - 1) / blockSize_thrust); + dim3 threadsPerBlock(blockSize_thrust); + + // Step 1 : Compute e array + kernCalc_e <<>>(dv_i.begin(), dv_e.begin(), n); + /* + thrust::copy_n(dv_e.begin(), n, odata); + printf("\n****** step 1 : dv_e\n----["); + for (int i = 0; i < n; i++) + { + printf(" %3d", odata[i]); + } + printf("]\n"); //*/ + // Step 2 : Exclusive Scan e + thrust::exclusive_scan(dv_e.begin(), dv_e.end(), dv_f.begin()); + /* + thrust::copy_n(dv_f.begin(), n, odata); + printf("\n****** step 2 : dv_f\n----["); + for (int i = 0; i < n; i++) + { + printf(" %3d", odata[i]); + } + printf("]\n"); //*/ + + //Step 3 : Compute totalFalses + int totalFalses = dv_e[n - 1] + dv_f[n - 1]; + + //Step 4 : Compute t array + kernCalc_t << <1, n >> >(dv_f.begin(), dv_t.begin(), totalFalses); + + + // Step 5 : Scatter based on d + kernCalc_d << <1, n >> >(dv_f.begin(), dv_t.begin(), dv_e.begin(), dv_d.begin()); + + + kernScatterBasedOn_d << <1, n >> >(dv_i.begin(), dv_o.begin(), dv_d.begin()); + + + thrust::copy_n(dv_o.begin(), n, odata); + /*printf("\n****** final : dv_o\n----["); + for (int i = 0; i < n; i++) + { + printf(" %3d", odata[i]); + } + printf("]\n"); */ + + } + } +} \ No newline at end of file diff --git a/stream_compaction/thrust.h b/stream_compaction/thrust.h index 06707f3..372d4e3 100644 --- a/stream_compaction/thrust.h +++ b/stream_compaction/thrust.h @@ -1,7 +1,11 @@ #pragma once namespace StreamCompaction { -namespace Thrust { - void scan(int n, int *odata, const int *idata); -} + namespace Thrust { + void scan(int n, int *odata, const int *idata); + } + namespace RadixSort + { + void sort(int n, int *odata, const int *idata); + } }