Push aggregations down to Spark #117

jeremyrsmith · 2017-10-18T21:43:26Z

When using withDataFrame, Vegas collects all the data and has a threshold for sampling instead.

But when doing aggregations in your plot, this means it will fetch all the data to the driver – potentially sampling it – and push all of it to vega-lite, where the aggregation will happen in JavaScript in the browser. This is probably never what you want.

It would be totally possible to map AggOps to Spark aggregations, and push the aggregation itself down to Spark. This will reduce the cardinality of the data dramatically, and would probably eliminate the need to sample in most cases.

The text was updated successfully, but these errors were encountered:

oshikiri · 2018-07-28T16:39:03Z

Thanks, @jeremyrsmith

This is probably never what you want.

I also agree with that. I think that the default behaviour should be changed; it had better pass all the data by default to vega-lite.

Vegas/spark/src/main/scala/vegas/sparkExt/package.scala

Lines 8 to 17 in 1496432

    
           val DefaultLimit = 10000 
        
           implicit class VegasSpark[T](val specBuilder: DataDSL[T]) { 
        
             def withDataFrame(df: DataFrame, limit: Int = DefaultLimit): T = { 
        
               val columns: Array[String] = df.columns 
        
               val count: Double = df.count 
        
               val data = { 
        
                 if (count >= limit) df.sample(false, limit / count).collect() else df.collect() 
        
               }.map { row =>

oshikiri mentioned this issue Jul 28, 2018

Does vegas-spark sub-sample spark dataframes? #120

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Push aggregations down to Spark #117

Push aggregations down to Spark #117

jeremyrsmith commented Oct 18, 2017

oshikiri commented Jul 28, 2018

Push aggregations down to Spark #117

Push aggregations down to Spark #117

Comments

jeremyrsmith commented Oct 18, 2017

oshikiri commented Jul 28, 2018