Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push aggregations down to Spark #117

Open
jeremyrsmith opened this issue Oct 18, 2017 · 1 comment
Open

Push aggregations down to Spark #117

jeremyrsmith opened this issue Oct 18, 2017 · 1 comment

Comments

@jeremyrsmith
Copy link
Contributor

When using withDataFrame, Vegas collects all the data and has a threshold for sampling instead.

But when doing aggregations in your plot, this means it will fetch all the data to the driver – potentially sampling it – and push all of it to vega-lite, where the aggregation will happen in JavaScript in the browser. This is probably never what you want.

It would be totally possible to map AggOps to Spark aggregations, and push the aggregation itself down to Spark. This will reduce the cardinality of the data dramatically, and would probably eliminate the need to sample in most cases.

@oshikiri
Copy link
Collaborator

Thanks, @jeremyrsmith

This is probably never what you want.

I also agree with that. I think that the default behaviour should be changed; it had better pass all the data by default to vega-lite.

val DefaultLimit = 10000
implicit class VegasSpark[T](val specBuilder: DataDSL[T]) {
def withDataFrame(df: DataFrame, limit: Int = DefaultLimit): T = {
val columns: Array[String] = df.columns
val count: Double = df.count
val data = {
if (count >= limit) df.sample(false, limit / count).collect() else df.collect()
}.map { row =>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants