Add steps to install multi-threaded OpenBLAS on Ubuntu for the gh-pages branch #81

kloudkl · 2014-02-07T10:05:35Z

In fact , this PR contains exactly the same changes with #80. But the INSTALL.md in the root directory of the master branch cannot be edited simultaneously with the one in the gh-pages branch.

This fixes issue: #79

shelhamer · 2014-02-10T21:57:59Z

installation.md

@@ -45,6 +46,12 @@ You will also need other packages, most of which can be installed via apt-get us

    sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libboost-all-dev

+With 8 or more CPU threads, Caffe runs as fast as on a decent GPU. If you would like to enjoy the speed-up brought by OpenBLAS:


Please add this performance note to the end of the "Prerequisites" section because it is optional. I fear that "Caffe runs as fast as on a decent GPU" is imprecise and doesn't hold for contemporary scale models, so perhaps replace this line with "To enjoy a parallelization speed-up in CPU mode, install the multi-threaded OpenBLAS"

Yangqing · 2014-02-10T23:27:58Z

With the nice performance benchmarks @kloudkl provided, it seems that a
more reasonable approach is to get rid of eigen - since the only purpose is
to fit in the MKL VSL library, which is not very hard to implement anyway
(a vector template function would solve it)? I know it is a little
back-and-forth so your inputs are highly appreciated.

Reasons to get rid of eigen are: (1) My original impression is that eigen
provides a faster blas implementation than other open-source packages but
it apparently does not seem the case; (2) we are using eigen very lightly,
only using it to do vectorized addition/subtraction/multiplication, so a
heavy dependency probably does more harm.

Yangqing

kloudkl · 2014-02-11T06:16:51Z

Maybe we should learn something from #68. If MKL can be redistributed freely, why bother porting anyway? After cherry-picking from the boost-eigen branch some bug-fixes that also apply to the master branch, Boost based random number generator, and documentation to use plug-and-play open source replacements of MKL, the other parts of it can be safely archived.

Yangqing · 2014-02-11T06:24:55Z

A few things needed to be checked:

(1) Note MKL academic license does not come with redistribution rights.
(2) People may not fancy an open-source project with closed-source,
binary-only distribution.
(3) VSL needs to be implemented in an open-source way (currently
boost-eigen branch uses eigen).

Other than that, yes, I agree we can simplify the current paths.

Yangqing

On Mon, Feb 10, 2014 at 10:16 PM, kloudkl [email protected] wrote:

Maybe we should learn something from #68 #68.
If MKL can be redistributed freely, why bother porting anyway? After
cherry-picking from the boost-eigen branch some bug-fixes that also apply
to the master branch, Boost based random number generator, and
documentation to use plug-and-play open source replacements of MKL, the
other parts of it can be safely archived.

Reply to this email directly or view it on GitHubhttps://github.com//pull/81#issuecomment-34729027
.

shelhamer · 2014-02-11T07:04:00Z

I agree that porting away from MKL is still important for openness and portability.

(1) The intel compiler and MKL documentation states that the academic license–not student, but academic license–gives redistribution rights [1]. Although we have this option, it does not suit the academic and open spirit of the project.

(2) Closed-source / binary-only distribution hinders development. With MKL only those with licenses can develop and contribute. Further, there are hassles in preparing and hosting binary releases.

(3) Agreed.

Let's finish the port in this branch then rebase for any final grooming before the merge. Although I enjoyed @kloudkl 's cinematic proposal for archival, after cherry-picking there would be little left to archive so I'd rather not keep the vestigial branch around.

[1] http://software.intel.com/sites/default/files/m/d/4/1/d/8/11x_Redistribution_FAQ_Linux.pdf (see overview and MKL sections)

Intel's redistributable compiler libraries package may be added to the distributed application 
package for any end-user application built or partially built with an Intel compiler that is 
distributed by an Intel customer who holds an Intel Commercial or Academic license for the 
associated Intel compiler product.
[...]
With a license that includes access to Intel MKL, you receive rights to redistribute computational
portions of Intel MKL with your application

kloudkl · 2014-02-11T16:43:11Z

In summary of the previous discussions, you both agree that redistribution of MKL is not an option. The only house keeping task seems to be deleting the already commented out MKL codes and cleaning the Makefile.

To get rid of Eigen, optimization steps using SIMD intrinsics are detailed below. Can these be accomplished in a simpler way?

Allocate aligned CPU memory in syncedmem.hpp. The change potentially breaks Caffe on Windows #15. Any cross platform alternative?

#include <stdlib.h>
//  Align on 16 bytes to use [__m128d](http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/index.htm#intref_cls/common/intref_sse2_overview.htm).
const size_t  CPU_MEM_ALIGNMENT = 16;
inline void CaffeMallocHost(void** ptr, size_t size) {
  //*ptr = malloc(size);
  if (posix_memalign(ptr, CPU_MEM_ALIGNMENT, size)) {
    // Something wrong happened.
  }
}

inline void CaffeFreeHost(void* ptr) {
  free(ptr); // with posix_memalign, free is ok
}

Vectorize all the element-wise arithmetic operations of vectors with SIMD intrinsics.

SSE sum of vectors - how to improve cache performance

#include <x86intrin.h>
inline void addToDoubleVectorSSE(const double * what, const double * toWhat, volatile double * dest, const unsigned int len)
{
   __m128d * _what = (__m128d*)what;
   __m128d * _toWhat = (__m128d*)toWhat;
   __m128d * _toWhatBase = (__m128d*)toWhat;

   __m128d _dest1;
   __m128d _dest2;

   #ifdef FAST_SSE
      for ( register unsigned int i = 0; i < len; i+= 4, _what += 2, _toWhat += 2, _toWhatBase+=2 )
      {
         _toWhatBase = _toWhat;
         _dest1 = _mm_add_pd( *_what, *_toWhat );               //line A
         _dest2 = _mm_add_pd( *(_what+1), *(_toWhat+1));    //line B

         *_toWhatBase = _dest1;
         *(_toWhatBase+1) = _dest2;
      }
   #else
      for ( register unsigned int i = 0; i < len; i+= 4 )
      {
         _toWhatBase = _toWhat;
         _dest1 = _mm_add_pd( *_what++, *_toWhat++ );
         _dest2 = _mm_add_pd( *_what++, *_toWhat++ );

         *_toWhatBase++ = _dest1;
         *_toWhatBase++ = _dest2;
      }
   #endif
}

Maratyszcza / CSE6230 / Lecture-6 / example1 / compute.cpp

void vector_add_sse2_aligned(const double *CSE6230_RESTRICT xPointer, const double *CSE6230_RESTRICT yPointer, double *CSE6230_RESTRICT sumPointer, size_t length) {
    // Process arrays by two elements at an iteration
    for (; length >= 2; length -= 2) {
        const __m128d x = _mm_load_pd(xPointer); // Aligned (!) load two x elements
        const __m128d y = _mm_load_pd(yPointer); // Aligned (!) load two y elements
        const __m128d sum = _mm_add_pd(x, y); // Compute two sum elements
        _mm_store_pd(sumPointer, sum); // Aligned (!) store two sum elements

        // Advance pointers to the next two elements
        xPointer += 2;
        yPointer += 2;
        sumPointer += 2;
    }
    // Process remaining elements (if any)
    for (; length != 0; length -= 1) {
        const double x = *xPointer; // Load x
        const double y = *yPointer; // Load y
        const double sum = x + y; // Compute sum
        *sumPointer = sum; // Store sum

        // Advance pointers to the next elements
        xPointer += 1;
        yPointer += 1;
        sumPointer += 1;
    }
}

Add unrolling to the aligned version and benchmark the performance!!!
Compile with -O3 -msse* to enable vectorization in GCC.

Yangqing · 2014-02-11T16:48:08Z

Element wise operation may not need much attention in terms of optimization
since they do not account for much computation time (all complexity are
o(n)), so starting with a simpler implementation is preferred.

On Tuesday, February 11, 2014, kloudkl [email protected] wrote:

In summary of the previous discussions, you both agree that redistribution
of MKL is not an option. The only house keeping task seems to be deleting
the already commented out MKL codes and cleaning the Makefile.

To get rid of Eigen, optimization steps using SIMD intrinsics are detailed
below. Can these be accomplished in a simpler way?

Allocate aligned CPU memoryhttp://man7.org/linux/man-pages/man3/posix_memalign.3.htmlin syncedmem.hpp. The change potentially breaks
Caffe on Windows #15 Caffe on Windows #15. Any cross platform
alternative?

#include <stdlib.h>// Align on 16 bytes to use __m128d.const size_t CPU_MEM_ALIGNMENT = 16;inline void CaffeMallocHost(void** ptr, size_t size) {
//ptr = malloc(size);
if (posix_memalign(ptr, CPU_MEM_ALIGNMENT, size)) {
// Something wrong happened.
}}
inline void CaffeFreeHost(void ptr) {
free(ptr); // with posix_memalign, free is ok}

Vectorize all the element-wise arithmetic operationshttp://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/hh_goto.htm#intref_cls/common/intref_sse2_fp_arithmetic.htmof vectors with SIMD
intrinsicshttps://docs.google.com/presentation/d/1GzJevk-vj5IXVSYfr98cSnGKUz56biZ1effhiVVfeqM/edit?pli=1#slide=id.p
.

SSE sum of vectors - how to improve cache performancehttp://software.intel.com/en-us/forums/topic/373710

#include <x86intrin.h>inline void addToDoubleVectorSSE(const double * what, const double * toWhat, volatile double * dest, const unsigned int len){
__m128d * _what = (_m128d)what;
__m128d * _toWhat = (_m128d)toWhat;
__m128d * _toWhatBase = (__m128d*)toWhat;

__m128d _dest1;
__m128d _dest2;
#ifdef FAST_SSE
for ( register unsigned int i = 0; i < len; i+= 4, _what += 2, _toWhat += 2, _toWhatBase+=2 )
{
_toWhatBase = _toWhat;
_dest1 = _mm_add_pd( *_what, *_toWhat ); //line A
_dest2 = _mm_add_pd( *(_what+1), *(_toWhat+1)); //line B
     *_toWhatBase = _dest1;
     *(_toWhatBase+1) = _dest2;
  }   #else
  for ( register unsigned int i = 0; i < len; i+= 4 )
  {
     _toWhatBase = _toWhat;
     _dest1 = _mm_add_pd( *_what++, *_toWhat++ );
     _dest2 = _mm_add_pd( *_what++, *_toWhat++ );

     *_toWhatBase++ = _dest1;
     *_toWhatBase++ = _dest2;
  }   #endif}
Maratyszcza / CSE6230 / Lecture-6 / example1 / compute.cpphttps://github.com/Maratyszcza/CSE6230/blob/master/Lecture-6/example1/compute.cpp

void vector_add_sse2_aligned(const double *CSE6230_RESTRICT xPointer, const double *CSE6230_RESTRICT yPointer, double *CSE6230_RESTRICT sumPointer, size_t length) {
// Process arrays by two elements at an iteration
for (; length >= 2; length -= 2) {
const __m128d x = _mm_load_pd(xPointer); // Aligned (!) load two x elements
const __m128d y = _mm_load_pd(yPointer); // Aligned (!) load two y elements
const __m128d sum = _mm_add_pd(x, y); // Compute two sum elements
_mm_store_pd(sumPointer, sum); // Aligned (!) store two sum elements
    // Advance pointers to the next two elements
    xPointer += 2;
    yPointer += 2;
    sumPointer += 2;
}
// Process remaining elements (if any)
for (; length != 0; length -= 1) {
    const double x = *xPointer; // Load x
    const double y = *yPointer; // Load y
    const double sum = x + y; // Compute sum
    *sumPointer = sum; // Store sum

    // Advance pointers to the next elements
    xPointer += 1;
    yPointer += 1;
    sumPointer += 1;
}}
Add unrolling to the aligned version and benchmark the performancehttp://inst.eecs.berkeley.edu/%7Ecs61c/sp14/labs/07/
!!!

Compile with -O3 -msse*http://gcc.gnu.org/onlinedocs/gcc/i386-and-x86-64-Options.html#i386-and-x86-64-Optionsto enable vectorization in GCC.

Reply to this email directly or view it on GitHubhttps://github.com//pull/81#issuecomment-34773382
.

Sent from Gmail Mobile - apologeez for any typoz.

Yangqing · 2014-02-11T18:04:10Z

With a for loop, since we will get rid of eigen.

On Tuesday, February 11, 2014, kloudkl [email protected] wrote:

You mean the easier way is to live along with Eigen?

Reply to this email directly or view it on GitHubhttps://github.com//pull/81#issuecomment-34774926
.

Sent from Gmail Mobile - apologeez for any typoz.

kloudkl · 2014-02-12T01:06:28Z

I have tried this line of attack with OpenMP. It seems that thread parallelism is far from being able to compete with data parallelism. It took forever for the unit tests to end. At the end of the day, the only choice is to stick to the SIMD backed Eigen. Otherwise, we have to either implement an in-house version using the same set of intrinsics or find another math library which contradicts the motivation of saying goodbye to Eigen. After all, Eigen is a not so heavy header only library which is not as perfect as expected but is still very good at something.

I have also given a shot at the following two ideas mentioned above.

Allocate aligned CPU memory in syncedmem.hpp.
Compile with -O3 -msse* to enable vectorization in GCC.
If every pointer passed in a math function is aligned, we can use Eigen::Map<..., Eigen::Aligned> to boost the speed of gemm or gemv. I will measure if it works in a couple of days.

Yangqing · 2014-02-12T01:14:45Z

Question is, how much overall speed gain do we get by optimizing e.g.
adding two vectors? Most time is spent on conv/gemm as you may reasonably
see.

http://c2.com/cgi/wiki?PrematureOptimization

Yangqing

On Tue, Feb 11, 2014 at 5:06 PM, kloudkl [email protected] wrote:

I have tried this line of attack with OpenMP. It seems that thread
parallelism is far from being able to compete with data parallelism. It
took forever for the unit tests to end. At the end of the day, the only
choice is to stick to the SIMD backed Eigen. Otherwise, we have to either
implement an in-house version using the same set of intrinsics or find
another math library which contradicts the motivation of saying goodbye to
Eigen. Eigen is not as perfect as expected but is still very good at
something.

I have also given a shot at the following two ideas mentioned above.

Allocate aligned CPU memory in syncedmem.hpp.

Compile with -O3 -msse* to enable vectorization in GCC.
If every pointer passed in a math function is aligned, we can use
Eigen::Map<..., Eigen::Aligned> to boost the speed of gemm or gemv. I will
measure if it works in a couple of days.

Reply to this email directly or view it on GitHubhttps://github.com//pull/81#issuecomment-34828188
.

kloudkl · 2014-02-12T06:00:16Z

Ordinary for loop even with OpenMP is yet too slow to tolerate for exhaustive gradient checking.

checker.CheckGradientExhaustive(layer, this->blob_bottom_vec_, this->blob_top_vec_);

The vectorized arithmetic Eigen API is almost equivalent to MKL v?Func except being a little more verbose.

rodrigob · 2014-02-13T14:24:38Z

I 100% agree with the idea of moving away from MKL, specially due to the latests benchmarks.

I agree that having less dependencies is better, so if it does not hurt, getting rid of Eigen would be go.
However I am somewhat confused about the proposal of implementing memory aligned allocators or SIMD enabled vector additions.
Both of these functions are implemented in a portable and mature manner in Eigen (Eigen::aligned_allocator, and Eigen::Vector/Matrix).

If we want to remove Eigen to then reimplement Eigen, seems like a non-sense, specially since it is indeed a lightweight headers only library. Being open source it can be even included in the caffe distribution if we really worry about "ease of access".

Using Eigen for most vector matrix operations (other that the critical ones, where it seems that raw OpenBlas pays off) gives the benefit of being fairly confident that it is "fast enough", and not have to worry (blindly) about "should we try to further optimize this or that ?".

Using Eigen is also an easy way to maintain cross-platform support #15.

I am not an Eigen advocator, I just worry about re-implementing what other open source projects have already solved (classical example is #54 ).

Yangqing · 2014-02-13T16:03:11Z

I won't go as far as what kloudkl proposed. All VSL functionality will be
replaced with a simple for loop implementation, and everything else will be
just blas. This won't introduce noticeable performance reduction.

See pull request #97 for an example.

On Thursday, February 13, 2014, Rodrigo Benenson [email protected]
wrote:

I 100% agree with the idea of moving away from MKL, specially due to the
latests benchmarks.

I agree that having less dependencies is better, so if it does not hurt,
getting rid of Eigen would be go.
However I am somewhat confused about the proposal of implementing memory
aligned allocators or SIMD enabled vector additions.
Both of these functions are implemented in a portable and mature manner in
Eigen (Eigen::aligned_allocator, and Eigen::Vector/Matrix).

If we want to remove Eigen to then reimplement Eigen, seems like a
non-sense, specially since it is indeed a lightweight headers only library.
Being open source it can be even included in the caffe distribution if we
really worry about "ease of access".

Using Eigen for most vector matrix operations (other that the critical
ones, where it seems that raw OpenBlas pays off) gives the benefit of being
fairly confident that it is "fast enough", and not have to worry (blindly)
about "should we try to further optimize this or that ?".

Using Eigen is also an easy way to maintain cross-platform support #15 #15
.

I am not an Eigen advocator, I just worry about re-implementing what other
open source projects have already solved (classical example is #54 #54).

Reply to this email directly or view it on GitHubhttps://github.com//pull/81#issuecomment-34982166
.

Sent from Gmail Mobile - apologeez for any typoz.

rodrigob · 2014-02-13T17:59:46Z

I see. That macro code is not super pretty, but sure does the job.

On the other hand if we have #ifdef USE_MKL then we could just as well #ifdef USE_EIGEN and then measure performances.
If really useless then we can drop the USE_EIGEN regions, if useful, then the user can choose his dependencies.

Also in #97 I notice comments on data alignment issue. This is rather easy to fix, and should provide meaningful speed improvements. To be looked into (but first I want to learn how to benchmark the code, so as to only do meaningful changes).

Yangqing · 2014-02-13T18:03:03Z

Agreed. The purpose of #97 I think is to separate front ends and back ends, allowing further speed benchmarks and backend changes to be easily plugged in and out.

sguada · 2014-02-13T18:10:54Z

To benchmark you can use #83
./build/examples/net_speed_benchmark imagenet.prototxt 10 [CPU | GPU]

Or just train the network for 100 iterations and get overall timing

kloudkl · 2014-02-14T03:04:52Z

With the help of #83, it was found that performance bottleneck is most likely the convolutional layers. The conv layer mainly relies on gemm, gemv and im2col. The biggest gain can be expected if we focus on this layer and perform finer grained diagnostics.

According to @Yangqing's explanation in #102, gemv is slower than gemm in batch sgd. To determine which lines are actually the performance hotspots, focused profiling with NVIDIA profiler or gperftools is the most effective and conclusive method. GPU mode has higher priority.

Take home message: do not profile or optimize the non-critical parts.

Yangqing · 2014-02-14T03:09:56Z

To clarify, I am not saying gemv is slower than gemm per se - what I mean
is, in non-batch mode, fc layers effectively becomes gemv because there is
only one example; with batches fc becomes gemm, making it more efficient.
In short, batch works better than non batch.

Yangqing

On Thu, Feb 13, 2014 at 7:04 PM, kloudkl [email protected] wrote:

With the help of #83 #83, it was
found that performance bottleneck is most likely the convolutional layers.
The conv layer mainly relies on gemm, gemv and im2col. The biggest gain can
be expected if we focus on this layer and perform finer grained diagnostics.

According to @Yangqing https://github.com/Yangqing's explanation in #102 #102,
gemv is slower than gemm in batch sgd. To determine which lines are
actually the performance hotspots, focused profiling with NVIDIA profilerhttp://docs.nvidia.com/cuda/profiler-users-guide/#focusing-profilingor
gperftoolshttp://google-perftools.googlecode.com/svn/trunk/doc/cpuprofile.htmlis the most effective and conclusive method. GPU mode has higher priority.

Take home message: do not profile or optimize the non-critical parts.

Reply to this email directly or view it on GitHubhttps://github.com//pull/81#issuecomment-35051537
.

sguada · 2014-02-14T21:44:46Z

To help to get a better sense of the bottleneck, here there is the output of profilling 10 repetitions of batches of size 50 doing forward and backward computations

nvprof ./build/examples/net_speed_benchmark.bin prototxt/imagenet_50.prototxt 10 GPU

======== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 22.63%  830.04ms      4433  187.24us  151.42us  4.0412ms  sgemm_sm35_ldg_nn_64x16x64x16x16
 18.74%  687.26ms      3883  176.99us  76.414us  4.5506ms  sgemm_sm35_ldg_tn_32x16x64x8x16
 11.34%  416.06ms      4400  94.559us  48.287us  1.3606ms  sgemm_sm35_ldg_nt_128x8x128x16x16
  7.09%  260.15ms      4400  59.125us  35.007us  1.8900ms  sgemm_sm35_ldg_nt_64x16x128x8x32
  6.59%  241.70ms        50  4.8341ms     800ns  59.396ms  [CUDA memcpy HtoD]
  5.78%  212.05ms      5500  38.554us  15.520us  1.3514ms  void caffe::im2col_gpu_kernel<float>(int, float const *, int, int, int, int, int, int, caffe::im2col_gpu_kernel<float>*)
  5.55%  203.74ms      3883  52.469us  35.519us  1.5576ms  sgemm_sm35_ldg_nt_128x16x64x16x16
  4.28%  156.87ms      2750  57.045us  27.328us  3.0300ms  void caffe::col2im_gpu_kernel<float>(int, float const *, int, int, int, int, int, int, int, caffe::col2im_gpu_kernel<float>*)
  3.84%  140.94ms        33  4.2708ms  640.91us  9.4756ms  void caffe::MaxPoolBackward<float>(int, float const *, float const , float const , int, int, int, int, int, int, int, int, caffe::MaxPoolBackward<float>*)
  3.12%  114.48ms       550  208.15us  200.51us  1.6030ms  sgemm_sm35_ldg_tn_128x8x256x16x32
  2.55%  93.367ms        33  2.8293ms  351.00us  6.7397ms  sgemm_sm_heavy_nt_ldg
  1.62%  59.369ms       550  107.94us  99.614us  1.2480ms  sgemm_sm35_ldg_tn_64x16x128x8x32
  1.23%  45.130ms      2761  16.345us  6.2080us  49.759us  void gemv2T_kernel_val<float, int=128, int=16, int=2, int=2, bool=0>(int, int, float, float const *, int, float const *, int, float, float*, int)
  0.90%  33.133ms        77  430.30us  9.7920us  1.9430ms  void caffe::ReLUBackward<float>(int, float const *, float const , caffe::ReLUBackward<float>*)
  0.79%  29.048ms        33  880.24us  156.51us  2.0327ms  void caffe::MaxPoolForward<float>(int, float const *, int, int, int, int, int, int, int, int, caffe::MaxPoolForward<float>*)
  0.69%  25.125ms      2794  8.9920us  4.0960us  17.823us  void gemmk1_kernel<float, int=256, int=5, bool=0, bool=0, bool=0, bool=0>(cublasGemmk1Params<float>, float const *, float const *, float*)
  0.68%  25.123ms        22  1.1419ms  1.0641ms  1.2203ms  void caffe::LRNComputeDiff<float>(int, float const *, float const , float const , float const , int, int, int, int, int, caffe::LRNComputeDiff<float>, caffe::LRNComputeDiff<float>, caffe::LRNComputeDiff<float>*)
  0.54%  19.929ms        77  258.82us  7.2000us  1.6944ms  void caffe::ReLUForward<float>(int, float const *, caffe::ReLUForward<float>*)
  0.51%  18.601ms        44  422.74us  228.92us  1.4904ms  void caffe::PaddingForward<float>(int, float const *, caffe::PaddingForward<float>*, int, int, int, int, int)
  0.39%  14.352ms        44  326.18us  232.57us  375.90us  void caffe::PaddingBackward<float>(int, float const *, caffe::PaddingBackward<float>*, int, int, int, int, int)
  0.30%  10.923ms        22  496.52us  184.80us  1.4197ms  void caffe::LRNComputeOutput<float>(int, float const *, float const , caffe::LRNComputeOutput<float>, caffe::LRNComputeOutput<float>*)
  0.30%  10.842ms       222  48.837us  3.2320us  1.2389ms  [CUDA memset]
  0.24%  8.6222ms         1  8.6222ms  8.6222ms  8.6222ms  generate_seed_pseudo(__int64, __int64, __int64, curandOrdering, curandStateXORWOW*, unsigned int*)
  0.14%  5.0060ms        11  455.09us  454.68us  455.48us  void caffe::kernel_get_max<float>(int, int, float const *, caffe::kernel_get_max<float>*)
  0.12%  4.5289ms        22  205.86us  185.66us  229.92us  void caffe::LRNFillScale<float>(int, float const *, int, int, int, int, int, caffe::LRNFillScale<float>, caffe::LRNFillScale<float>*)
  0.01%  342.75us        22  15.579us  15.039us  16.960us  void gen_sequenced<curandStateXORWOW, unsigned int, int, __operator_&__(unsigned int curand_noargs<curandStateXORWOW>(curandStateXORWOW*, int))>(curandStateXORWOW*, unsigned int*, unsigned long, unsigned long, int)
  0.01%  240.03us        22  10.910us  10.336us  13.471us  void caffe::DropoutBackward<float>(int, float const *, unsigned int const *, unsigned int, float, caffe::DropoutBackward<float>*)
  0.01%  233.50us        22  10.613us  10.400us  11.936us  void caffe::DropoutForward<float>(int, float const *, unsigned int const *, unsigned int, float, caffe::DropoutForward<float>*)
  0.01%  191.93us        22  8.7240us  8.0000us  9.1840us  void gemv2N_kernel_val<float, int=128, int=4, int=4, int=4>(float, float, cublasGemv2Params<float>)
  0.00%  77.723us        11  7.0650us  4.9280us  7.8720us  void caffe::kernel_softmax_div<float>(int, int, float const *, caffe::kernel_softmax_div<float>*)
  0.00%  74.207us        11  6.7460us  6.3680us  7.4240us  void gemv2N_kernel_val<float, int=128, int=8, int=4, int=4>(float, float, cublasGemv2Params<float>)
  0.00%  69.631us         4  17.407us  2.3360us  32.640us  [CUDA memcpy DtoH]
  0.00%  53.179us        11  4.8340us  4.4150us  5.6960us  [CUDA memcpy DtoD]
  0.00%  44.608us        11  4.0550us  3.6800us  4.8320us  void caffe::kernel_exp<float>(int, float const *, caffe::kernel_exp<float>*)

kloudkl · 2014-02-15T00:57:03Z

Thanks! @sguada provides very valuable work to help understand the underlying performance characteristics again.

This issue probably has been superseded by #97. I am not sure whether we still need to recommend OpenBLAS in the installation documentation. The discussion about diagnosing the performance bottlenecks and optimizing them may continue in #102 and beyond.

shelhamer · 2014-02-18T19:41:39Z

Largely superseded by #97, although there's no reason not to include a note about OpenBLAS.

Thanks for your experiments with eigen, openmp, and openblas @kloudkl.

kloudkl mentioned this pull request Feb 8, 2014

Replace atlas/cblas routines with Eigen in the math functions #85

Closed

shelhamer reviewed Feb 10, 2014
View reviewed changes

This was referenced Feb 10, 2014

Make gemm fully dependent on eigen #84

Closed

Add steps to install multi-threaded OpenBLAS on Ubuntu #80

Closed

Add recommendation of multi-threaded BLAS library

bd14dde

shelhamer added the hardware/portability label Feb 12, 2014

kloudkl mentioned this pull request Feb 13, 2014

Identify the critical parts of computation time in GPU mode #102

Closed

Yangqing closed this Feb 15, 2014

kloudkl deleted the gh-pages branch February 15, 2014 01:48

kloudkl mentioned this pull request Feb 15, 2014

Use adaptive CUDA launch config to fully utilize GPU devices #111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add steps to install multi-threaded OpenBLAS on Ubuntu for the gh-pages branch #81

Add steps to install multi-threaded OpenBLAS on Ubuntu for the gh-pages branch #81

kloudkl commented Feb 7, 2014

shelhamer Feb 10, 2014

Yangqing commented Feb 10, 2014

kloudkl commented Feb 11, 2014

Yangqing commented Feb 11, 2014

shelhamer commented Feb 11, 2014

kloudkl commented Feb 11, 2014

Yangqing commented Feb 11, 2014

Yangqing commented Feb 11, 2014

kloudkl commented Feb 12, 2014

Yangqing commented Feb 12, 2014

kloudkl commented Feb 12, 2014

rodrigob commented Feb 13, 2014

Yangqing commented Feb 13, 2014

rodrigob commented Feb 13, 2014

Yangqing commented Feb 13, 2014

sguada commented Feb 13, 2014

kloudkl commented Feb 14, 2014

Yangqing commented Feb 14, 2014

sguada commented Feb 14, 2014

kloudkl commented Feb 15, 2014

shelhamer commented Feb 18, 2014

		@@ -45,6 +46,12 @@ You will also need other packages, most of which can be installed via apt-get us

		sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libboost-all-dev

		With 8 or more CPU threads, Caffe runs as fast as on a decent GPU. If you would like to enjoy the speed-up brought by OpenBLAS:

Add steps to install multi-threaded OpenBLAS on Ubuntu for the gh-pages branch #81

Add steps to install multi-threaded OpenBLAS on Ubuntu for the gh-pages branch #81

Conversation

kloudkl commented Feb 7, 2014

shelhamer Feb 10, 2014

Choose a reason for hiding this comment

Yangqing commented Feb 10, 2014

kloudkl commented Feb 11, 2014

Yangqing commented Feb 11, 2014

shelhamer commented Feb 11, 2014

kloudkl commented Feb 11, 2014

Yangqing commented Feb 11, 2014

Yangqing commented Feb 11, 2014

kloudkl commented Feb 12, 2014

Yangqing commented Feb 12, 2014

kloudkl commented Feb 12, 2014

rodrigob commented Feb 13, 2014

Yangqing commented Feb 13, 2014

rodrigob commented Feb 13, 2014

Yangqing commented Feb 13, 2014

sguada commented Feb 13, 2014

kloudkl commented Feb 14, 2014

Yangqing commented Feb 14, 2014

sguada commented Feb 14, 2014

kloudkl commented Feb 15, 2014

shelhamer commented Feb 18, 2014