Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 2.0.0 support #108

Open
a-roberts opened this issue Jun 14, 2016 · 13 comments
Open

Spark 2.0.0 support #108

a-roberts opened this issue Jun 14, 2016 · 13 comments

Comments

@a-roberts
Copy link

I'm working on this and will submit a pull request once done, we face NoSuchMethodError problems once you try to run anything but scheduling-throughput

The fix for that is to modify spark-tests/project/SparkTestsBuild.scala - use 2.0.0-preview for org.apache.spark dependency version and Scala 2.11.8; specifically this resolves

NoSuchMethodError: org/apache/spark/SparkContext.rddToPairRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/rdd/PairRDDFunctions; at spark.perf.AggregateByKey.runTest(KVDataTest.scala:137) 

which is triggered by

class AggregateByKey(sc: SparkContext) extends KVDataTest(sc) {
  override def runTest(rdd: RDD[_], reduceTasks: Int) {
    rdd.asInstanceOf[RDD[(String, String)]]
      .map{case (k, v) => (k, v.toInt)}.reduceByKey(_ + _, reduceTasks).count()
  } 
}
@a-roberts
Copy link
Author

a-roberts commented Jun 14, 2016

With only the above change we get

16/06/14 12:52:44 INFO ContextCleaner: Cleaned shuffle 9
Exception in thread "main" java.lang.NoSuchMethodError: org/json4s/jackson/JsonMethods$.render$default$2(Lorg/json4s/JsonAST$JValue;)Lorg/json4s/Formats;
        **at spark.perf.TestRunner$.main(TestRunner.scala:47)**
        at spark.perf.TestRunner.main(TestRunner.scala)

By removing the call to render we can now build and run all of SparkPerf with Spark 2.0.0 (there's probably a better fix, I played around with the json4s import versions but without success). The files to change are

modified:   lib/sparkperf/testsuites.py
modified:   mllib-tests/project/MLlibTestsBuild.scala
modified:   spark-tests/project/SparkTestsBuild.scala
modified:   streaming-tests/project/StreamingTestsBuild.scala
modified:   spark-tests/src/main/scala/spark/perf/TestRunner.scala

Pull request to follow

@a-roberts
Copy link
Author

a-roberts commented Jun 20, 2016

All modules* built OK, code changes currently at https://github.com/a-roberts/spark-perf/commit/5f090fc2f1c272b839cee8965c77293d018c18d1

I'll sanity check this first by running all of the tests before contributing, noticed a few API changes we need to handle and I've also changed the configuration file to look for $SPARK_HOME instead of /root by default

Still working on MLlib actually, in my commit nothing for this module is built (duration 0s!)

@a-roberts
Copy link
Author

I've updated my commit using the new APIs available in the latest Spark 2 code, I think we should either create a new branch for 2.0 or simply provide different defaults if we detect the user specifies Spark 2 (e.g. Scala 2.11.8 not Scala 2.10.x). I've verified all ML tests now function as expected

This is currently relying on us having the jars from a recently built Spark 2 in the lib folder for all spark-perf projects - this is because the APIs have changed since the spark-2.0.0-preview artifact which is in Maven central and the requirement will be removed once spark-2.0.0 artifacts are available.

Would appreciate having this reviewed, you can easily view my changes at master...a-roberts:master

@a-roberts
Copy link
Author

We've noticed a 30% geomean regression for Spark 2 and this SparkPerf vs Spark 1.5.2 and "normal" SparkPerf i.e. before this changeset, this is running with a low scale factor and the configuration below.

Either my changes are a real disaster or we've noticed a significant performance regression, we can gather a 1.6.2 comparison but would like for my changes for the benchmark itself to be checked so we can rule out problems here.

@pwendell as a top contributor to this project can you or anybody else familiar with the new Spark 2 APIs please review this changeset?

Configuration used where we see the big regression:

  1. spark-perf/config/config.py : SCALE_FACTOR=0.05
    No. Of Workers: 1
    Executor per Worker : 1
    Executor Memory: 18G
    Driver Memory : 8G
    Serializer: kryo
  2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options: -Xdisableexplicitgc -Xcompressedrefs

Main changes I made

  • Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem
  • MLAlgorithmTests use Vectors.fromML
  • For streaming-tests HdfsRecoveryTest we use wordStream.foreachRDD not wordStream.foreach
  • KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext instead of awaitTermination
  • Trivial: we use compact not compact.render for outputting json

In Spark 2.0 the top five methods where we spend our time is as follows, the percentage is how much of the overall processing time was spent in this particular method:

  1. AppendOnlyMap.changeValue 44%
  2. SortShuffleWriter.write 19%
  3. SizeTracker.estimateSize 7.5%
  4. SizeEstimator.estimate 5.36%
  5. Range.foreach 3.6%

and in 1.5.2 the top five methods are:

  1. AppendOnlyMap.changeValue 38%
  2. ExternalSorter.insertAll 33%
  3. Range.foreach 4%
  4. SizeEstimator.estimate 2%
  5. SizeEstimator.visitSingleObject 2%

I see the following scores, on the left I have the test name followed by the 1.5.2 time and then the 2.0.0 time
sthroughput: 5.2s vs 7.08s
agg by key; 0.72s vs 1.01s
agg by key int: 0.93s vs 1.19s
agg by key naive: 1.88s vs 2.02
sort by key: 0.64s vs 0.8s
sort by key int: 0.59s vs 0.64s
scala count: 0.09s vs 0.08s
scala count w fltr: 0.31s vs 0.47s

This is only running the Spark core tests (scheduling throughput through scala-count-w-filtr, including all inbetween).

Will mention this on the mailing list as part of a general performance regression thread so this particular item remains focused on the Spark 2.0.0 changes i have made for SparkPerf, the goal is to have something stable to compare Spark releases with.

@a-roberts
Copy link
Author

I'm updating this to work with Spark 2 now that it's available and we don't need to use a snapshot or to build with an included version

@somideshmukh
Copy link

so now we need to clone and build new spark-perf to work with spark 2.0.
and which are the modules of Spark-Perf which will work with Spark 2.0.

@a-roberts
Copy link
Author

All modules, my PR is at #115

@somideshmukh
Copy link

but ,when I have gone to https://github.com/databricks/spark-perf.git and tried to clone master.I haven't found any commit for 2.0

@a-roberts
Copy link
Author

a-roberts commented Aug 26, 2016

That's because my change is a pull request that hasn't been merged, working on a small issue regarding the Spark version now with the mllib project as I see the travis-cl integration build failed, would be much appreciated if you clone my changes and see if you find any problems

@somideshmukh
Copy link

Hi,I have clone your changes and integrated it with Spark 2.0 and run Spark-Test.I have got proper results with no error.Only change that I need to do was no in config.py file were in place of MLLIB_SPARK_VERSION = 2.0.0 and need to keep MLLIB_SPARK_VERSION = 2.0

@Minkyolyy
Copy link

where i can clone your changes?

@saksgarg
Copy link

Any update on the issues in this project?

@Minkyolyy
Copy link

maybe spark-perf 2.0 Just replace some of the packages,and didn't paly the advantage of dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants