-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add steps to install multi-threaded OpenBLAS on Ubuntu for the gh-pages branch #81
Conversation
@@ -45,6 +46,12 @@ You will also need other packages, most of which can be installed via apt-get us | |||
|
|||
sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libboost-all-dev | |||
|
|||
With 8 or more CPU threads, Caffe runs as fast as on a decent GPU. If you would like to enjoy the speed-up brought by OpenBLAS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this performance note to the end of the "Prerequisites" section because it is optional. I fear that "Caffe runs as fast as on a decent GPU" is imprecise and doesn't hold for contemporary scale models, so perhaps replace this line with "To enjoy a parallelization speed-up in CPU mode, install the multi-threaded OpenBLAS"
With the nice performance benchmarks @kloudkl provided, it seems that a Reasons to get rid of eigen are: (1) My original impression is that eigen Yangqing |
Maybe we should learn something from #68. If MKL can be redistributed freely, why bother porting anyway? After cherry-picking from the boost-eigen branch some bug-fixes that also apply to the master branch, Boost based random number generator, and documentation to use plug-and-play open source replacements of MKL, the other parts of it can be safely archived. |
A few things needed to be checked: (1) Note MKL academic license does not come with redistribution rights. Other than that, yes, I agree we can simplify the current paths. Yangqing On Mon, Feb 10, 2014 at 10:16 PM, kloudkl [email protected] wrote:
|
I agree that porting away from MKL is still important for openness and portability. (1) The intel compiler and MKL documentation states that the academic license–not student, but academic license–gives redistribution rights [1]. Although we have this option, it does not suit the academic and open spirit of the project. (2) Closed-source / binary-only distribution hinders development. With MKL only those with licenses can develop and contribute. Further, there are hassles in preparing and hosting binary releases. (3) Agreed. Let's finish the port in this branch then rebase for any final grooming before the merge. Although I enjoyed @kloudkl 's cinematic proposal for archival, after cherry-picking there would be little left to archive so I'd rather not keep the vestigial branch around. [1] http://software.intel.com/sites/default/files/m/d/4/1/d/8/11x_Redistribution_FAQ_Linux.pdf (see overview and MKL sections)
|
In summary of the previous discussions, you both agree that redistribution of MKL is not an option. The only house keeping task seems to be deleting the already commented out MKL codes and cleaning the Makefile. To get rid of Eigen, optimization steps using SIMD intrinsics are detailed below. Can these be accomplished in a simpler way?
#include <stdlib.h>
// Align on 16 bytes to use [__m128d](http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/index.htm#intref_cls/common/intref_sse2_overview.htm).
const size_t CPU_MEM_ALIGNMENT = 16;
inline void CaffeMallocHost(void** ptr, size_t size) {
//*ptr = malloc(size);
if (posix_memalign(ptr, CPU_MEM_ALIGNMENT, size)) {
// Something wrong happened.
}
}
inline void CaffeFreeHost(void* ptr) {
free(ptr); // with posix_memalign, free is ok
}
SSE sum of vectors - how to improve cache performance #include <x86intrin.h>
inline void addToDoubleVectorSSE(const double * what, const double * toWhat, volatile double * dest, const unsigned int len)
{
__m128d * _what = (__m128d*)what;
__m128d * _toWhat = (__m128d*)toWhat;
__m128d * _toWhatBase = (__m128d*)toWhat;
__m128d _dest1;
__m128d _dest2;
#ifdef FAST_SSE
for ( register unsigned int i = 0; i < len; i+= 4, _what += 2, _toWhat += 2, _toWhatBase+=2 )
{
_toWhatBase = _toWhat;
_dest1 = _mm_add_pd( *_what, *_toWhat ); //line A
_dest2 = _mm_add_pd( *(_what+1), *(_toWhat+1)); //line B
*_toWhatBase = _dest1;
*(_toWhatBase+1) = _dest2;
}
#else
for ( register unsigned int i = 0; i < len; i+= 4 )
{
_toWhatBase = _toWhat;
_dest1 = _mm_add_pd( *_what++, *_toWhat++ );
_dest2 = _mm_add_pd( *_what++, *_toWhat++ );
*_toWhatBase++ = _dest1;
*_toWhatBase++ = _dest2;
}
#endif
} Maratyszcza / CSE6230 / Lecture-6 / example1 / compute.cpp void vector_add_sse2_aligned(const double *CSE6230_RESTRICT xPointer, const double *CSE6230_RESTRICT yPointer, double *CSE6230_RESTRICT sumPointer, size_t length) {
// Process arrays by two elements at an iteration
for (; length >= 2; length -= 2) {
const __m128d x = _mm_load_pd(xPointer); // Aligned (!) load two x elements
const __m128d y = _mm_load_pd(yPointer); // Aligned (!) load two y elements
const __m128d sum = _mm_add_pd(x, y); // Compute two sum elements
_mm_store_pd(sumPointer, sum); // Aligned (!) store two sum elements
// Advance pointers to the next two elements
xPointer += 2;
yPointer += 2;
sumPointer += 2;
}
// Process remaining elements (if any)
for (; length != 0; length -= 1) {
const double x = *xPointer; // Load x
const double y = *yPointer; // Load y
const double sum = x + y; // Compute sum
*sumPointer = sum; // Store sum
// Advance pointers to the next elements
xPointer += 1;
yPointer += 1;
sumPointer += 1;
}
}
|
Element wise operation may not need much attention in terms of optimization On Tuesday, February 11, 2014, kloudkl [email protected] wrote:
Sent from Gmail Mobile - apologeez for any typoz. |
With a for loop, since we will get rid of eigen. On Tuesday, February 11, 2014, kloudkl [email protected] wrote:
Sent from Gmail Mobile - apologeez for any typoz. |
I have tried this line of attack with OpenMP. It seems that thread parallelism is far from being able to compete with data parallelism. It took forever for the unit tests to end. At the end of the day, the only choice is to stick to the SIMD backed Eigen. Otherwise, we have to either implement an in-house version using the same set of intrinsics or find another math library which contradicts the motivation of saying goodbye to Eigen. After all, Eigen is a not so heavy header only library which is not as perfect as expected but is still very good at something. I have also given a shot at the following two ideas mentioned above.
|
Question is, how much overall speed gain do we get by optimizing e.g. http://c2.com/cgi/wiki?PrematureOptimization Yangqing On Tue, Feb 11, 2014 at 5:06 PM, kloudkl [email protected] wrote:
|
Ordinary for loop even with OpenMP is yet too slow to tolerate for exhaustive gradient checking. checker.CheckGradientExhaustive(layer, this->blob_bottom_vec_, this->blob_top_vec_); The vectorized arithmetic Eigen API is almost equivalent to MKL v?Func except being a little more verbose. |
I 100% agree with the idea of moving away from MKL, specially due to the latests benchmarks. I agree that having less dependencies is better, so if it does not hurt, getting rid of Eigen would be go. If we want to remove Eigen to then reimplement Eigen, seems like a non-sense, specially since it is indeed a lightweight headers only library. Being open source it can be even included in the caffe distribution if we really worry about "ease of access". Using Eigen for most vector matrix operations (other that the critical ones, where it seems that raw OpenBlas pays off) gives the benefit of being fairly confident that it is "fast enough", and not have to worry (blindly) about "should we try to further optimize this or that ?". Using Eigen is also an easy way to maintain cross-platform support #15. I am not an Eigen advocator, I just worry about re-implementing what other open source projects have already solved (classical example is #54 ). |
I won't go as far as what kloudkl proposed. All VSL functionality will be See pull request #97 for an example. On Thursday, February 13, 2014, Rodrigo Benenson [email protected]
Sent from Gmail Mobile - apologeez for any typoz. |
I see. That macro code is not super pretty, but sure does the job. On the other hand if we have #ifdef USE_MKL then we could just as well #ifdef USE_EIGEN and then measure performances. Also in #97 I notice comments on data alignment issue. This is rather easy to fix, and should provide meaningful speed improvements. To be looked into (but first I want to learn how to benchmark the code, so as to only do meaningful changes). |
Agreed. The purpose of #97 I think is to separate front ends and back ends, allowing further speed benchmarks and backend changes to be easily plugged in and out. |
To benchmark you can use #83 Or just train the network for 100 iterations and get overall timing |
With the help of #83, it was found that performance bottleneck is most likely the convolutional layers. The conv layer mainly relies on gemm, gemv and im2col. The biggest gain can be expected if we focus on this layer and perform finer grained diagnostics. According to @Yangqing's explanation in #102, gemv is slower than gemm in batch sgd. To determine which lines are actually the performance hotspots, focused profiling with NVIDIA profiler or gperftools is the most effective and conclusive method. GPU mode has higher priority. Take home message: do not profile or optimize the non-critical parts. |
To clarify, I am not saying gemv is slower than gemm per se - what I mean Yangqing On Thu, Feb 13, 2014 at 7:04 PM, kloudkl [email protected] wrote:
|
To help to get a better sense of the bottleneck, here there is the output of profilling 10 repetitions of batches of size 50 doing forward and backward computations
|
Thanks! @sguada provides very valuable work to help understand the underlying performance characteristics again. This issue probably has been superseded by #97. I am not sure whether we still need to recommend OpenBLAS in the installation documentation. The discussion about diagnosing the performance bottlenecks and optimizing them may continue in #102 and beyond. |
In fact , this PR contains exactly the same changes with #80. But the INSTALL.md in the root directory of the master branch cannot be edited simultaneously with the one in the gh-pages branch.
This fixes issue: #79