Spark 2.0.0 support #108

a-roberts · 2016-06-14T11:59:05Z

I'm working on this and will submit a pull request once done, we face NoSuchMethodError problems once you try to run anything but scheduling-throughput

The fix for that is to modify spark-tests/project/SparkTestsBuild.scala - use 2.0.0-preview for org.apache.spark dependency version and Scala 2.11.8; specifically this resolves

NoSuchMethodError: org/apache/spark/SparkContext.rddToPairRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/rdd/PairRDDFunctions; at spark.perf.AggregateByKey.runTest(KVDataTest.scala:137)

which is triggered by

class AggregateByKey(sc: SparkContext) extends KVDataTest(sc) {
  override def runTest(rdd: RDD[_], reduceTasks: Int) {
    rdd.asInstanceOf[RDD[(String, String)]]
      .map{case (k, v) => (k, v.toInt)}.reduceByKey(_ + _, reduceTasks).count()
  } 
}

The text was updated successfully, but these errors were encountered:

a-roberts · 2016-06-14T15:18:48Z

With only the above change we get

16/06/14 12:52:44 INFO ContextCleaner: Cleaned shuffle 9
Exception in thread "main" java.lang.NoSuchMethodError: org/json4s/jackson/JsonMethods$.render$default$2(Lorg/json4s/JsonAST$JValue;)Lorg/json4s/Formats;
        **at spark.perf.TestRunner$.main(TestRunner.scala:47)**
        at spark.perf.TestRunner.main(TestRunner.scala)

By removing the call to render we can now build and run all of SparkPerf with Spark 2.0.0 (there's probably a better fix, I played around with the json4s import versions but without success). The files to change are

modified:   lib/sparkperf/testsuites.py
modified:   mllib-tests/project/MLlibTestsBuild.scala
modified:   spark-tests/project/SparkTestsBuild.scala
modified:   streaming-tests/project/StreamingTestsBuild.scala
modified:   spark-tests/src/main/scala/spark/perf/TestRunner.scala

Pull request to follow

a-roberts · 2016-06-20T13:47:03Z

All modules* built OK, code changes currently at https://github.com/a-roberts/spark-perf/commit/5f090fc2f1c272b839cee8965c77293d018c18d1

I'll sanity check this first by running all of the tests before contributing, noticed a few API changes we need to handle and I've also changed the configuration file to look for $SPARK_HOME instead of /root by default

Still working on MLlib actually, in my commit nothing for this module is built (duration 0s!)

a-roberts · 2016-07-01T16:22:50Z

I've updated my commit using the new APIs available in the latest Spark 2 code, I think we should either create a new branch for 2.0 or simply provide different defaults if we detect the user specifies Spark 2 (e.g. Scala 2.11.8 not Scala 2.10.x). I've verified all ML tests now function as expected

This is currently relying on us having the jars from a recently built Spark 2 in the lib folder for all spark-perf projects - this is because the APIs have changed since the spark-2.0.0-preview artifact which is in Maven central and the requirement will be removed once spark-2.0.0 artifacts are available.

Would appreciate having this reviewed, you can easily view my changes at master...a-roberts:master

a-roberts · 2016-07-08T09:40:09Z

We've noticed a 30% geomean regression for Spark 2 and this SparkPerf vs Spark 1.5.2 and "normal" SparkPerf i.e. before this changeset, this is running with a low scale factor and the configuration below.

Either my changes are a real disaster or we've noticed a significant performance regression, we can gather a 1.6.2 comparison but would like for my changes for the benchmark itself to be checked so we can rule out problems here.

@pwendell as a top contributor to this project can you or anybody else familiar with the new Spark 2 APIs please review this changeset?

Configuration used where we see the big regression:

spark-perf/config/config.py : SCALE_FACTOR=0.05
No. Of Workers: 1
Executor per Worker : 1
Executor Memory: 18G
Driver Memory : 8G
Serializer: kryo
$SPARK_HOME/conf/spark-defaults.conf: executor Java Options: -Xdisableexplicitgc -Xcompressedrefs

Main changes I made

Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem
MLAlgorithmTests use Vectors.fromML
For streaming-tests HdfsRecoveryTest we use wordStream.foreachRDD not wordStream.foreach
KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext instead of awaitTermination
Trivial: we use compact not compact.render for outputting json

In Spark 2.0 the top five methods where we spend our time is as follows, the percentage is how much of the overall processing time was spent in this particular method:

AppendOnlyMap.changeValue 44%
SortShuffleWriter.write 19%
SizeTracker.estimateSize 7.5%
SizeEstimator.estimate 5.36%
Range.foreach 3.6%

and in 1.5.2 the top five methods are:

AppendOnlyMap.changeValue 38%
ExternalSorter.insertAll 33%
Range.foreach 4%
SizeEstimator.estimate 2%
SizeEstimator.visitSingleObject 2%

I see the following scores, on the left I have the test name followed by the 1.5.2 time and then the 2.0.0 time
sthroughput: 5.2s vs 7.08s
agg by key; 0.72s vs 1.01s
agg by key int: 0.93s vs 1.19s
agg by key naive: 1.88s vs 2.02
sort by key: 0.64s vs 0.8s
sort by key int: 0.59s vs 0.64s
scala count: 0.09s vs 0.08s
scala count w fltr: 0.31s vs 0.47s

This is only running the Spark core tests (scheduling throughput through scala-count-w-filtr, including all inbetween).

Will mention this on the mailing list as part of a general performance regression thread so this particular item remains focused on the Spark 2.0.0 changes i have made for SparkPerf, the goal is to have something stable to compare Spark releases with.

a-roberts · 2016-08-08T09:33:43Z

I'm updating this to work with Spark 2 now that it's available and we don't need to use a snapshot or to build with an included version

somideshmukh · 2016-08-26T09:08:23Z

so now we need to clone and build new spark-perf to work with spark 2.0.
and which are the modules of Spark-Perf which will work with Spark 2.0.

a-roberts · 2016-08-26T09:12:03Z

All modules, my PR is at #115

somideshmukh · 2016-08-26T09:48:14Z

but ,when I have gone to https://github.com/databricks/spark-perf.git and tried to clone master.I haven't found any commit for 2.0

a-roberts · 2016-08-26T10:14:08Z

That's because my change is a pull request that hasn't been merged, working on a small issue regarding the Spark version now with the mllib project as I see the travis-cl integration build failed, would be much appreciated if you clone my changes and see if you find any problems

somideshmukh · 2016-08-30T07:54:35Z

Hi,I have clone your changes and integrated it with Spark 2.0 and run Spark-Test.I have got proper results with no error.Only change that I need to do was no in config.py file were in place of MLLIB_SPARK_VERSION = 2.0.0 and need to keep MLLIB_SPARK_VERSION = 2.0

Minkyolyy · 2016-11-08T06:45:08Z

where i can clone your changes?

saksgarg · 2016-12-25T20:55:30Z

Any update on the issues in this project?

Minkyolyy · 2017-01-11T02:29:36Z

maybe spark-perf 2.0 Just replace some of the packages,and didn't paly the advantage of dataset

willb mentioned this issue Jul 14, 2016

Dockerfile for building and running + SBT Fixes. #111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 2.0.0 support #108

Spark 2.0.0 support #108

a-roberts commented Jun 14, 2016

a-roberts commented Jun 14, 2016 •

edited

Loading

a-roberts commented Jun 20, 2016 •

edited

Loading

a-roberts commented Jul 1, 2016

a-roberts commented Jul 8, 2016

a-roberts commented Aug 8, 2016

somideshmukh commented Aug 26, 2016

a-roberts commented Aug 26, 2016

somideshmukh commented Aug 26, 2016

a-roberts commented Aug 26, 2016 •

edited

Loading

somideshmukh commented Aug 30, 2016

Minkyolyy commented Nov 8, 2016

saksgarg commented Dec 25, 2016

Minkyolyy commented Jan 11, 2017

Spark 2.0.0 support #108

Spark 2.0.0 support #108

Comments

a-roberts commented Jun 14, 2016

a-roberts commented Jun 14, 2016 • edited Loading

a-roberts commented Jun 20, 2016 • edited Loading

a-roberts commented Jul 1, 2016

a-roberts commented Jul 8, 2016

a-roberts commented Aug 8, 2016

somideshmukh commented Aug 26, 2016

a-roberts commented Aug 26, 2016

somideshmukh commented Aug 26, 2016

a-roberts commented Aug 26, 2016 • edited Loading

somideshmukh commented Aug 30, 2016

Minkyolyy commented Nov 8, 2016

saksgarg commented Dec 25, 2016

Minkyolyy commented Jan 11, 2017

a-roberts commented Jun 14, 2016 •

edited

Loading

a-roberts commented Jun 20, 2016 •

edited

Loading

a-roberts commented Aug 26, 2016 •

edited

Loading