Apache Spark support in pipelinedp4j for end-to-end Differential Privacy #278

sakkumar · 2024-11-28T22:01:18Z

Apache Spark support in pipelinedp4j for end-to-end Differential Privacy

Testing
%cd pipelinedp4j
%bazelisk build ...
%bazelisk test ...

%cd examples/pipelinedp4j
%bazelisk build ...

SparkExample:
bazel-bin/SparkExample --local-input-file-path="./netflix_data.csv" --local-output-file-path="./output/"

Starting calculations...
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
24/11/27 15:55:43 INFO SparkContext: Running Spark version 3.3.2
....
....
24/11/27 15:55:48 INFO SparkContext: Successfully stopped SparkContext
Finished calculations.

pipelinedp4j % cat ./output/part-00000-867b9339-e539-47b5-a9fd-d7140d441a1f-c000.txt 

movieId=4506, numberOfViewers=6854, numberOfViews=6841, averageOfRatings=3.9186866446697803
movieId=4505, numberOfViewers=234, numberOfViews=235, averageOfRatings=2.509885102993678
movieId=4503, numberOfViewers=1770, numberOfViews=1802, averageOfRatings=3.20902639675554
movieId=4500, numberOfViewers=257, numberOfViews=242, averageOfRatings=3.0546646050432593
movieId=4501, numberOfViewers=578, numberOfViews=594, averageOfRatings=3.107717355258994

RamSaw

Thank you very much, Saket, for your contribution!

RamSaw

I realized that we didn't run Google formatter. Could you apply it? I created two patches, apply them sequentially (Format_files.patch first and then Format_files_2.patch).

Format_files.patch
Format_files_2.patch

RamSaw · 2024-12-02T13:23:06Z

I used https://github.com/google/google-java-format for java and https://github.com/facebook/ktfmt for Kotlin. I will add them into GitHub Actions and also add maybe an easy way to install them and apply to the code right before committing the changes.

sakkumar · 2024-12-02T17:29:33Z

I realized that we didn't run Google formatter. Could you apply it? I created two patches, apply them sequentially (Format_files.patch first and then Format_files_2.patch).

Format_files.patch Format_files_2.patch

Done

sakkumar added 30 commits November 10, 2024 01:03

Spark kotlin setup

4401ec8

jackson databind dependency issue

a1d164f

fix jacksondata bind version issue

78dee66

Spark Encoders

45345d3

Add Spark Table implementation

fd08491

spark table encoder

70c0af3

adding unit test for spark table

8e36964

Add more unit tests for SparkTable

86a2ec8

spark encoder pair runtime exception

c19df1b

Use scala 2.13

62900f0

Add more unit test for spark collection

90c1862

remove spaces

2be844f

Pair<T1, T2> Encoder for Spark

82b0e75

resolve PR comments

5e7ef96

gitignore /.ijwb/.idea/ files

827ca27

gitignore /.ijwb/.idea files

b9e6282

gitignore /.ijwb/.idea files

d25d324

gitignore ijwb files

9ef6d38

create class rule to create spark session for each test class run

e5cf4d1

Add implementation of samplePerKey for SparkTable

af9a64a

Add QueryBuilder for Spark

bb8ff4b

resolve comments

d51ab2f

Spark QueryBuilder implementation

e6cd0f1

Remove comments

488cfff

add copyright comment

bcff159

rename variables

cd91c0a

Add copyright for files

d537548

Added comment and renamed variables in filterKeysStoredInSparkCollection

5b0649a

Added formatting changes for new liner

809f524

SparkExample for end-to-end Differential Privacy

6b48b47

sakkumar added 7 commits November 27, 2024 16:05

Remove println

d62c6f5

resolve comments

f3cda40

resolve comments

b87c7fc

spark kryo serializer requires class to be public

5f7eda2

resolve comments

60abc3d

Merge with origin

33426db

Correct Java comment

f95557d

RamSaw assigned RamSaw and unassigned RamSaw Dec 2, 2024

RamSaw self-requested a review December 2, 2024 12:04

RamSaw assigned sakkumar Dec 2, 2024

RamSaw approved these changes Dec 2, 2024

View reviewed changes

RamSaw requested changes Dec 2, 2024

View reviewed changes

Format files

26ea30b

RamSaw approved these changes Dec 2, 2024

View reviewed changes

RamSaw merged commit 1dfe8f9 into google:main Dec 5, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache Spark support in pipelinedp4j for end-to-end Differential Privacy #278

Apache Spark support in pipelinedp4j for end-to-end Differential Privacy #278

sakkumar commented Nov 28, 2024

RamSaw left a comment

RamSaw left a comment

RamSaw commented Dec 2, 2024

sakkumar commented Dec 2, 2024

Apache Spark support in pipelinedp4j for end-to-end Differential Privacy #278

Apache Spark support in pipelinedp4j for end-to-end Differential Privacy #278

Conversation

sakkumar commented Nov 28, 2024

RamSaw left a comment

Choose a reason for hiding this comment

RamSaw left a comment

Choose a reason for hiding this comment

RamSaw commented Dec 2, 2024

sakkumar commented Dec 2, 2024