From a9d836de92f413190df9b854e225ad80f6a4139f Mon Sep 17 00:00:00 2001 From: stanbrub Date: Tue, 3 Dec 2024 21:10:50 -0700 Subject: [PATCH 01/12] First pass at adhoc documentation --- README.md | 44 +------- docs/AdhocWorkflows.md | 71 +++++++++++++ docs/GithubSecrets.md | 217 ++++++++++++++++++++++++++++++++++++++++ docs/TestingConcepts.md | 29 ++++++ 4 files changed, 320 insertions(+), 41 deletions(-) create mode 100644 docs/AdhocWorkflows.md create mode 100644 docs/GithubSecrets.md create mode 100644 docs/TestingConcepts.md diff --git a/README.md b/README.md index 14aa6e4b..6c29448a 100644 --- a/README.md +++ b/README.md @@ -26,52 +26,14 @@ from the engine(s) to reduce the affect of I/O and test setup on the results. Resources: - [Getting Started](docs/GettingStarted.md) - Getting set up to run benchmarks against Deephaven Community Core +- [Testing Concepts](docs/TestingConcepts.md) - Understanding what drives Benchmark development - [Test-writing Basics](docs/TestWritingBasics.md) - How to generate data and use it for tests - [Collected Results](docs/CollectedResults.md) - What's in the benchmark results - [Running the Release Distribution](docs/distro/BenchmarkDistribution.md) - How to run Deephaven benchmarks from a release tar file - [Running from the Command Line](docs/CommandLine.md) - How to run the benchmark jar with a test package +- [Running Adhoc Github Workflows](docs/AdhocWorkflows.md) - Running benchmark sets on-demand from Github - [Published Results Storage](docs/PublishedResults.md) - How to grab and use Deephaven's published benchmarks - -## Concepts - -### Self-guided API -The *Bench* API uses the builder pattern to guide the test writer in generating data, executing queries, and fetching results. There is a single API -entry point where a user can follow the dots and look at the code-insight and Javadocs that pop up in the IDE. Default properties -can be overriden by builder-style "with" methods like *withRowCount()*. A middle ground is taken between text configuration and configuration -fully-expressed in code to keep things simple and readable. - -### Scale Rather Than Iterations -Repeating tests can be useful for testing the effects of caching (e.g. load file multiple times; is it faster on subsequent loads?), or overcoming a lack of -precision in OS timers (e.g. run a fast function many times and average), or average out variability between runs (there are always anomalies). On the other hand, -if the context of the test is processing large data sets, then it's better to measure against large data sets where possible. This provides a benchmark test -that's closer to the real thing when it comes to memory consumption, garbage collection, thread usage, and JIT optimizations. Repeating tests, though useful in -some scenarios, can have the effect of taking the operation under test out of the benchmark equation because of cached results, resets for each iteration, -limited heap usage, or smaller data sets that are too uniform. - -### Adjust Scale For Each Test -When measuring a full set of benchmarks for transforming data, some benchmarks will naturally be faster than others (e.g. sums vs joins). Running all benchmarks -at the same scale (e.g. 10 million rows) could yield results where one benchmark takes a minute and another takes 100 milliseconds. Is the 100 ms test -meaningful, especially when measured in a JVM? Not really, because there is no time to assess the impact of JVM ergonomics or the effect of OS background -tasks. A better way is to set scale multipliers to amplify row count for tests that need it. - -### Test-centric Design -Want to know what tables and operations the test uses? Go to the test. Want to know what the framework is doing behind the scenes? Step through the test. -Want to run one or more tests? Start from the test rather than configuring an external tool and deploying to that. Let the framework handle the hard part. -The point is that a benchmark test against a remote server should be as easy and clear to write as a unit test. As far as is possible, data generation -should be defined in the same place it's used... in the test. - -### Running in Multiple Contexts -Tests are developed by test-writers, so why not make it easy for them? Run tests from the IDE for ease of debugging. Point the tests to a local or a remote -Deephaven Server instance. Or package tests in a jar and run them locally or remotely from the Benchmark uber-jar. The same tests should work whether -running everything on the same system or different system. - -### Measure Where It Matters -The Benchmark framework allows the test-writer to set each benchmark measurement from the test code instead of relying on a mechanism that measures -automatically behind the scenes. Measurements can be taken across the execution of the test locally with a *Timer* like in the -[JoinTablesFromKafkaStreamTest](src/it/java/io/deephaven/benchmark/tests/internal/examples/stream/JoinTablesFromKafkaStreamTest.java) example test -or fetched from the remote Deephaven instance where the test is running as is done in the -[StandardTestRunner](src/it/java/io/deephaven/benchmark/tests/standard/StandardTestRunner.java) -used for nightly Deephaven benchmarks. Either way the submission of the result to the Benchmark framework is under the test-writer's control. +- [Sssh Secrets](docs/GithubSecrets.md) - How to define Github secrets for running Benchmark workflows in a fork ## Other Deephaven Summaries diff --git a/docs/AdhocWorkflows.md b/docs/AdhocWorkflows.md new file mode 100644 index 00000000..b6906d9c --- /dev/null +++ b/docs/AdhocWorkflows.md @@ -0,0 +1,71 @@ +# Running Adhoc Workflows + +In addition to the benchmarks that are run nightly and after every release, developers can run adhoc benchmarks. These benchmarks can be configured to run small sets of standard benchmarks on-demand. This is useful for more targeted comparisons between multiple sets of [Deephaven Community Core](https://deephaven.io/community/) versions or configuration options. + +A common practice is to run a comparison from a source branch that is ready for review to the main branch for a subset of relevant benchmarks (e.g. Parquet). This allows developers to validate the performance impact of code changes before they are merged. Other possibilities include comparing JVM options for the same DHC version, comparing data distributions (e.g. ascending, descending), and comparing levels of data scale. + +All results are stored according to the initiating user and a user-supplied label in the public [Benchmarking GCloud bucket](https://console.cloud.google.com/storage/browser/deephaven-benchmark). Though the results are available through public URLs, Google Cloud browsing is not. Retrieval of the generated data is mainly the domain of the Adhoc Dashboard. + +Prerequisites: +- Permission to use Deephaven's Bare Metal servers and [Github Secrets](./GithubSecrets.md) +- An installation of a [Deephaven Community Core w/ Python docker image](https://deephaven.io/core/docs/getting-started/docker-install/) (0.36.1+) +- The Adhoc Dashboard python snippet shown in this guide + +### Common Workflow UI Field + +The ui fields used for both Adhoc workflows that are common are defined below: +- Use workflow from + - Select the branch where the desired benchmarks are. This is typically "main" but could be a branch in a fork +- Deephaven Image or Core Branch + - The [Deephaven Core](https://github.com/deephaven/deephaven-core) branch, commit hash, tag, or docker image/sha + - ex. Branch: `deephaven:main or myuser:mybranch` + - ex. Commit: `efad062e5488db50221647b63bd9b38e2eb2dc5a` + - ex. Tag: `v0.37.0` + - ex. Docker Image: `0.37.0` + - ex. Docker Sha: `edge@sha256:bba0344347063baff39c1b5c975573fb9773190458d878bea58dfab041e09976` +- Benchmark Test Classes + - Wildcard names of available test classes. For example, `Avg*` will match the AvgByTest + - Because of the nature of the benchmark runner, there is no way to select individual tests by name + - Test classes can be found under in the [standard test directory](https://github.com/deephaven/benchmark/tree/main/src/it/java/io/deephaven/benchmark/tests/standard) +- Benchmark Iterations + - The number of iterations to run for each benchmark. Be careful, large numbers may take hours or days + - Given that the Adhoc Dashboard uses medians, any even numbers entered here will be incremented +- Benchmark Scale Row Count + - The number of millions of rows for the base row count + - All standard benchmarks are scaled using this number. The default is 10 +- Benchmark Data Distribution + - The distribution the data is generated to follow for each column's successive values + - random: random symmetrical data distributed around and including 0 (e.g. -4, -8, 0, 1, 5) + - ascending: positive numbers that increase (e.g. 1, 2, 3, 4, 5) + - descending: negative numbers that decrease (e.g. -1, -2, -3, -4, -5) + - runlength: numbers that repeat (e.g. 1, 1, 1, 2, 2, 2, 3, 3, 3) + +### Adhoc Benchmarks (Auto-provisioned Server) + +The auto-provisioned adhoc workflow allows developers to run workflows on bare metal server hardware that is provisioned on the fly for the benchmark run. It requires two branches, tags, commit hashes, or docker images/shas to run for the same benchmark set. This is the workflow most commonly used to compare performance between a Deephaven PR branch and the main branch. + +Workflow fields not shared with the Existing Server workflow: +- Set Label Prefix + - The prefix used to make the Set Label for each side of the benchmark comparison + - ex. Setting `myprefix` with the images `0.36.0` and `0.37.0` for Deephaven Image or Core Branch cause two directories in the GCloud benchmark bucket + - `adhoc/githubuser/myprefix_0_36_0` and `adhoc/githubuser/myprefix_0_37_0` + - Because of naming rules, non-alpha-nums will be replaced with underscores + +### Adhoc Benchmarks (Existing Server) + +The adhoc workflow that uses an existing server allows developers more freedom to experiment with JVM options. It also gives them more freedom to shoot themselves in the foot. For example, if max heap is set bigger than the test server memory, the Deephaven service will crash. + +Workflow fields not shared with the Auto-provisioned Server workflow: +- Deephaven JVM Options + - Options that will be included as JVM arguments to the Deephaven service + - ex. `-Xmx24g -DQueryTable.memoizeResults=true` +- Set Label + - The label to used to store the result in the GCloud benchmark bucket + - ex. Setting `mysetlabel` would be stored at `adhoc/mygithubuser/mysetlabel` +- Benchmark Test Package + - The java package where the desired benchmark test classes are + - Unless making custom tests in a fork, use the default + +# The Adhoc Dashboard + + diff --git a/docs/GithubSecrets.md b/docs/GithubSecrets.md new file mode 100644 index 00000000..b241c189 --- /dev/null +++ b/docs/GithubSecrets.md @@ -0,0 +1,217 @@ +# Collected Results + +Running Benchmark tests in an IDE produces results in a directory structure in the current (working) directory. Running the same tests +from the command line through the deephaven-benchmark uber-jar yields the same directory structure for each run while accumulating the +data from repeated runs instead of overwriting it. + +### Example IDE-driven Directory Structure +```` +results/ + benchmark-metrics.csv + benchmark-platform.csv + benchmark-results.csv + test-logs/ + io.deephaven.benchmark.tests.query.examples.stream.JoinTablesFromKafkaStream.query.md + io.deephaven.benchmark.tests.query.examples.stream.JoinTablesFromParquetAndStream.query.md +```` + +### Example Command-line-driven Directory Structure +```` +results/ + run-17d06a7611 + benchmark-metrics.csv + benchmark-platform.csv + benchmark-results.csv + test-logs/ + io.deephaven.benchmark.tests.query.examples.stream.JoinTablesFromKafkaStream.query.md + io.deephaven.benchmark.tests.query.examples.stream.JoinTablesFromParquetAndStream.query.md + run-17d9ec2f2e + benchmark-metrics.csv + benchmark-platform.csv + benchmark-results.csv + test-logs/ + io.deephaven.benchmark.tests.query.examples.stream.JoinTablesFromKafkaStream.query.md + io.deephaven.benchmark.tests.query.examples.stream.JoinTablesFromParquetAndStream.query.md +```` + +What does each file mean? +- run\-\: Time-based unique id for the batch of tests that was run +- benchmark-metrics.csv: MXBean and other metrics collected over the test run +- benchmark-platform.csv: Various VM and hardware details for the components of the test system +- benchmark-results.csv: Query rates for the running tests at scale +- test-logs: Directory containing details about each test run according to test class +- \*.query.md: A log showing the queries that where executed to complete each test in the order they were executed + +## Benchmark Platform CSV + +The benchmark-platform csv contains details for the system/VM where the test runner and Deephave Engine ran. +Details include Available Processors, JVM version, OS version, etc. + +Columns defined in the file are: +- origin: The service where the property was collected +- name: The property name +- value: The property value for the service + +Properties defined in the file are: +- java.version: The version of java running the application +- java.vm.name: The name of the java virtual machine running the application +- java.class.version: The class version the java VM supports +- os.name: The name of the operating system hosting the application +- os.version: The version of the operating system hosting the application +- available.processors: The number of CPUs the application is allowed to use +- java.max.memory: Maximum amount of memory the application is allowed to use +- python.version: The version of python used in the Deephaven Engine +- deephaven.version: The version of Deephaven tested against (client and server may be different) + + +### Example benchmark-platform.csv +```` +origin, name, value +test-runner, java.version, 17.0.1 +test-runner, java.vm.name, OpenJDK 64-Bit Server VM +test-runner, java.class.version, 61 +test-runner, os.name, Windows 10 +test-runner, os.version, 10 +test-runner, available.processors, 16 +test-runner, java.max.memory, 15.98G +test-runner, deephaven.version, 0.22.0 +deephaven-engine, java.version, 17.0.5 +deephaven-engine, java.vm.name, OpenJDK 64-Bit Server VM +deephaven-engine, java.class.version, 61 +deephaven-engine, os.name, Linux +deephaven-engine, os.version, 5.15.79.1-microsoft-standard-WSL2 +deephaven-engine, available.processors, 12 +deephaven-engine, java.max.memory, 42.00G +deephaven-engine, python.version, 3.10.6 +```` + +## Benchmark Results CSV + +The benchmark-results.csv contains measurements taken of the course of each test run. One row is listed for each test. + +Fields supplied in the file are: +- benchmark_name: The unique name of the benchmark +- origin: The serice where the benchmark was collected +- timestamp: Millis since epoch at the beginning of the benchmark +- test_duration: Seconds elapsed for the entire test run including setup and teardown +- op_duration: Seconds elapsed for the operation under measurement +- op_rate: Processing rate supplied by the test-writer +- row_count: The number of rows processed by the operation + +### Example benchmark-results.csv +```` +benchmark_name,origin,timestamp,test_duration,op_duration,op_rate,row_count +Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926395799,155.4960,5.4450,367309458,2000000000 +Select- 1 Calc Using 2 Cols -Inc,deephaven-engine,1683926551491,12.1970,3.5900,111420612,400000000 +Select- 2 Cals Using 2 Cols -Static,n/a,1683926563714,8.7480,8.7480,1143118,10000000 +SelectDistinct- 1 Group 250 Unique Vals -Static,deephaven-engine,1683926572487,195.0420,18.5530,64679566,1200000000 +```` + +## Benchmark Metrics CSV + +The benchmark-metrics.csv contains metrics collected while running the benchmark. Most metrics (like MXBean metrics) represent a snapshot +at a moment in time. When these snapshots are taken and collected are up to the test-writer. For example, in the standard benchmarks +available in this project, metrics snaphosts are taken before and after the benchmark operation. The before-after metrics can be compared +to calculate things like Heap Gain or garbage collection counts. + +Field supplied in the file are: +- benchmark_name: The unique name of the benchmark +- origin: The serice where the metric was collected +- timestamp: Millis since epoch when the metrics was recorded +- category: A grouping category for the metric +- type: The type of metric has been collected (should be more narrowly focused than category) +- name: A metric name that is unique within the category +- value: The numeric value of the metric +- note: Any addition clarifying information + +## Example benchmark-metrics.csv +```` +benchmark_name,origin,timestamp,category,type,name,value,note +Select- 1 Calc Using 2 Cols -Static,test-runner,1683926402500,setup,docker,restart,6700,standard +Select- 1 Calc Using 2 Cols -Static,test-runner,1683926479998,source,generator,duration.secs,74.509, +Select- 1 Calc Using 2 Cols -Static,test-runner,1683926479998,source,generator,record.count,50000000, +Select- 1 Calc Using 2 Cols -Static,test-runner,1683926479998,source,generator,send.rate,671059.8719617764, +Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,ClassLoadingImpl,ClassLoading,TotalLoadedClassCount,13172, +Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,ClassLoadingImpl,ClassLoading,UnloadedClassCount,16, +Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,MemoryImpl,Memory,ObjectPendingFinalizationCount,0, +Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,MemoryImpl,Memory,HeapMemoryUsage Committed,9126805504, +Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,MemoryImpl,Memory,HeapMemoryUsage Init,1157627904, +Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,MemoryImpl,Memory,HeapMemoryUsage Max,25769803776, +Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,MemoryImpl,Memory,HeapMemoryUsage Used,2963455008, +Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,MemoryImpl,Memory,NonHeapMemoryUsage Committed,106430464, +Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,MemoryImpl,Memory,NonHeapMemoryUsage Init,7667712, +Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,MemoryImpl,Memory,NonHeapMemoryUsage Max,-1, +Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,MemoryImpl,Memory,NonHeapMemoryUsage Used,101707208, +Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,HotSpotThreadImpl,Threading,ThreadCount,50, +Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,HotSpotThreadImpl,Threading,PeakThreadCount,50, +```` + +## Query Log + +Query logs record queries in the order in which they were run during a test. These include queries run by the framework automatically behind the scenes. +Any property variables supplied in a query are replaced with the actual values used during the test run. After a test is run, it is usually possible to +copy and paste the query into the Deephaven UI and run it, because parquet and kafka topic data is left intact. However, if Test B is run after Test A, +running the recorded queries for Test A may not work, since cleanup is done automatically between tests. +The log is in Markdown format for easier viewing. + +### Example Query Log +~~~~ +# Test Class - io.deephaven.benchmark.tests.internal.examples.stream.JoinTablesFromKafkaStreamTest + +## Test - Count Records From Kakfa Stream + +### Query 1 +```` +from deephaven import kafka_consumer as kc +from deephaven.stream.kafka.consumer import TableType, KeyValueSpec + +def bench_api_kafka_consume(topic: str, table_type: str): + t_type = None + if table_type == 'append': t_type = TableType.append() + elif table_type == 'blink': t_type = TableType.blink() + elif table_type == 'ring': t_type = TableType.ring() + else: raise Exception('Unsupported kafka stream type: {}'.format(t_type)) + + return kc.consume( + { 'bootstrap.servers' : 'redpanda:29092', 'schema.registry.url' : 'http://redpanda:8081' }, + topic, partitions=None, offsets=kc.ALL_PARTITIONS_SEEK_TO_BEGINNING, + key_spec=KeyValueSpec.IGNORE, value_spec=kc.avro_spec(topic + '_record', schema_version='1'), + table_type=t_type) + +from deephaven.table import Table +from deephaven.ugp import exclusive_lock + +def bench_api_await_table_size(table: Table, row_count: int): + with exclusive_lock(): + while table.j_table.size() < row_count: + table.j_table.awaitUpdate() + +```` + +### Query 2 +```` +kafka_stock_trans = bench_api_kafka_consume('stock_trans', 'append') +bench_api_await_table_size(kafka_stock_trans, 100000) +```` + +### Query 3 +```` +kafka_stock_trans=None +from deephaven import garbage_collect; garbage_collect() +```` +~~~~ + +The above log is an example of what you would see in *test‑logs/io.deephaven.benchmark.tests.internal.examples.stream.JoinTablesFromKafkaStreamTest.query.md* +when running the *JoinTablesFromKafkaStreamTest.countRecordsFromKafkaStream()* integration test. + +The log has several components: +- Test Class: The fully-qualified class name of the test +- Test: The description of the test the test-writer supplied +- Query 1,2,3: The Deephaven queries executed in Deephaven in the order they were executed + +What the queries are doing: +- Query 1: The definitions of some the Bench API convenience functions used in Query 2. The test-writer used *bench_api_kafka_consume()* and +*bench_api_await_table_size()* in the query, and the Bench API automatically added the corresponding function definitions +- Query 2: The part that the test-writer wrote for the test (see *JoinTablesFromKafkaStreamTest.countRecordsFromKafkaStream()*) +- Query 3: The cleanup query added automatically by the Bench API + diff --git a/docs/TestingConcepts.md b/docs/TestingConcepts.md new file mode 100644 index 00000000..cbb8d0e3 --- /dev/null +++ b/docs/TestingConcepts.md @@ -0,0 +1,29 @@ +# Benchmarking Concepts + +Benchmark is designed to work easily in different contexts like within and IDE, from the command line, and as part of Github workflows. However, there is more to writing benchmarks than getting a test working and measuring between two points. Is it reproducible? Does it perform similarly at different scales? What about from day to day? Is setup included in the measurement? Is it easy to add to and maintain? + +What follows are some concepts that guide the development of Benchmark meant to kept it versatile, simple, and relevant. + +### Self-guided API +The *Bench* API uses the builder pattern to guide the test writer in generating data, executing queries, and fetching results. There is a single API entry point where a user can follow the dots and look at the code-insight and Javadocs that pop up in the IDE. Default properties can be overriden by builder-style "with" methods like *withRowCount()*. A middle ground is taken between text configuration and configuration fully-expressed in code to keep things simple and readable. + +### Scale Rather Than Iterations +Repeating tests can be useful for testing the effects of caching (e.g. load file multiple times; is it faster on subsequent loads?), or overcoming a lack of precision in OS timers (e.g. run a fast function many times and average), or average out variability between runs (there are always anomalies). On the other hand, if the context of the test is processing large data sets, then it's better to measure against large data sets where possible. This provides a benchmark test that's closer to the real thing when it comes to memory consumption, garbage collection, thread usage, and JIT optimizations. Repeating tests, though useful in some scenarios, can have the effect of taking the operation under test out of the benchmark equation because of cached results, resets for each iteration, +limited heap usage, or smaller data sets that are too uniform. + +### Adjust Scale For Each Test +When measuring a full set of benchmarks for transforming data, some benchmarks will naturally be faster than others (e.g. sums vs joins). Running all benchmarks at the same scale (e.g. 10 million rows) could yield results where one benchmark takes a minute and another takes 100 milliseconds. Is the 100 ms test meaningful, especially when measured in a JVM? Not really, because there is no time to assess the impact of JVM ergonomics or the effect of OS background tasks. A better way is to set scale multipliers to amplify row count for tests that need it. + +### Test-centric Design +Want to know what tables and operations the test uses? Go to the test. Want to know what the framework is doing behind the scenes? Step through the test. Want to run one or more tests? Start from the test rather than configuring an external tool and deploying to that. Let the framework handle the hard part. The point is that a benchmark test against a remote server should be as easy and clear to write as a unit test. As far as is possible, data generation should be defined in the same place it's used... in the test. + +### Running in Multiple Contexts +Tests are developed by test-writers, so why not make it easy for them? Run tests from the IDE for ease of debugging. Point the tests to a local or a remote Deephaven Server instance. Or package tests in a jar and run them locally or remotely from the Benchmark uber-jar. The same tests should work whether running everything on the same system or different system. + +### Measure Where It Matters +The Benchmark framework allows the test-writer to set each benchmark measurement from the test code instead of relying on a mechanism that measures automatically behind the scenes. Measurements can be taken across the execution of the test locally with a *Timer* like in the +[JoinTablesFromKafkaStreamTest](src/it/java/io/deephaven/benchmark/tests/internal/examples/stream/JoinTablesFromKafkaStreamTest.java) example test or fetched from the remote Deephaven instance where the test is running as is done in the [StandardTestRunner](src/it/java/io/deephaven/benchmark/tests/standard/StandardTestRunner.java) used for nightly Deephaven benchmarks. Either way the submission of the result to the Benchmark framework is under the test-writer's control. + +### Preserve and Compare +Most benchmarking efforts involve a fixed timeline for "improving performance" rather than tracking the performance impacts of code changes day-to-day. This can be effective, but how do you know if future code changes are degrading performance unless benchmarking is done every day. A better way is to preserve benchmarking results every day and compare to the results from the previous week. To avoid death by a thousand cuts, compare release to release as well for the same benchmarks. Also, compare benchmarks for your product with other products for equivalent operations. + From f92d83f2f263ffe3d8ee9b4f21811d603c5cdf71 Mon Sep 17 00:00:00 2001 From: stanbrub Date: Wed, 11 Dec 2024 17:50:27 -0700 Subject: [PATCH 02/12] Added Adhoc Dashboard explanation --- docs/AdhocWorkflows.md | 49 +++++++++++++++++++++++++++++++++++++++--- 1 file changed, 46 insertions(+), 3 deletions(-) diff --git a/docs/AdhocWorkflows.md b/docs/AdhocWorkflows.md index b6906d9c..ce9359f0 100644 --- a/docs/AdhocWorkflows.md +++ b/docs/AdhocWorkflows.md @@ -11,7 +11,7 @@ Prerequisites: - An installation of a [Deephaven Community Core w/ Python docker image](https://deephaven.io/core/docs/getting-started/docker-install/) (0.36.1+) - The Adhoc Dashboard python snippet shown in this guide -### Common Workflow UI Field +### Common Workflow UI Fields The ui fields used for both Adhoc workflows that are common are defined below: - Use workflow from @@ -58,7 +58,7 @@ The adhoc workflow that uses an existing server allows developers more freedom t Workflow fields not shared with the Auto-provisioned Server workflow: - Deephaven JVM Options - Options that will be included as JVM arguments to the Deephaven service - - ex. `-Xmx24g -DQueryTable.memoizeResults=true` + - ex. `-Xmx24g -DQueryTable.memoizeResults=false` - Set Label - The label to used to store the result in the GCloud benchmark bucket - ex. Setting `mysetlabel` would be stored at `adhoc/mygithubuser/mysetlabel` @@ -66,6 +66,49 @@ Workflow fields not shared with the Auto-provisioned Server workflow: - The java package where the desired benchmark test classes are - Unless making custom tests in a fork, use the default -# The Adhoc Dashboard +### The Adhoc Dashboard +The Adhoc Dashboard provides visualization for Benchmark results using Deephaven UI. The typical use case runs a set of benchmarks through the auto-provisioned workflow, and then the two result sets are compared using the dashboard. Results that are displayed include rate comparisons, a rate chart for the runs in each benchmark set, version and platform changes between each set, and basic metrics take before and after each measured run. +The Adhoc Dashboard is not intended to be bullet proof. It is expected that you know what values to fill in to the dashboard input, and there are no prompts or validation. There are also no "waiting" icons when waiting for data to be download from GCloud. + +### Running Adhoc Dashboard + +For ease of running on different Deephaven service locations, the snippet below is provided, which downloads and runs the Adhoc Dashboard remotely. This this snippet in DHC code studio. +``` +from urllib.request import urlopen; import os + +root = 'file:///nfs' if os.path.exists('/nfs/deephaven-benchmark') else 'https://storage.googleapis.com' +with urlopen(f'{root}/deephaven-benchmark/adhoc_dashboard.dh.py') as r: + exec(r.read().decode(), globals(), locals()) + storage_uri = f'{root}/deephaven-benchmark' +``` +This script will bring up a Deephaven UI dashboard separated generally into quadrants that are blank. To view results, use the following prescription: +- Input Text Fields + - Actor: Fill in your user (the one that ran the Adhoc workflow) + - Set Label: A portion of the set label (or prefix) supplied during the workflow to match at least one Benchmark Set + - Click the Apply button, and a table should be loaded underneath the input form +- Benchmark Results Table + - `Var_` column: A percentage deviation (variability) for the mean for the Benchmark Set runs + - `Rate_` column: The number of rows per second processed for the benchmark + - `Change_` column: The gain (+/-) of the rate compared to the first rate from the left +- Benchmark Rate Chart + - Click a benchmark row in the table + - A line chart appear showing the runs that make up each set + - Each series represents a Benchmark Set +- Benchmark Metric Charts + - Click a benchmark row in the table + - Dropdowns in the lower left quadrant will be populated with metrics + - Select a metrics of interest + - Compare to another series (e.g. Benchmark Set) or to the line chart on the left +- Dependency Diffs + - In the upper right, there are tabs showing various differences between the DHC versions/bits run + - Python Changes: Differences in module versions, additions, or subtractions + - Jar Changes: Differences in jar library versions, additions, or subtractions + +The most Common Error Occurs when entering Actor and Set Label values that do not match anything +``` +DHError +merge tables operation failed. : RuntimeError: java.lang.IllegalArgumentException: No non-null tables provided to merge +``` +In this case, check the User and Set Labels run during the Adhoc Workflow and try again. From ed6722f9351393fa867186a19a1b17a38c4d6546 Mon Sep 17 00:00:00 2001 From: stanbrub Date: Wed, 11 Dec 2024 17:54:39 -0700 Subject: [PATCH 03/12] Reword --- docs/AdhocWorkflows.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/AdhocWorkflows.md b/docs/AdhocWorkflows.md index ce9359f0..f67c949c 100644 --- a/docs/AdhocWorkflows.md +++ b/docs/AdhocWorkflows.md @@ -111,4 +111,4 @@ The most Common Error Occurs when entering Actor and Set Label values that do no DHError merge tables operation failed. : RuntimeError: java.lang.IllegalArgumentException: No non-null tables provided to merge ``` -In this case, check the User and Set Labels run during the Adhoc Workflow and try again. +In this case, check the Actor/User and Set Labels (or Prefix) values used for the Adhoc Workflow and try again. From 190e6be67a2b86573dba932c8b13b21b309ecd22 Mon Sep 17 00:00:00 2001 From: stanbrub Date: Thu, 12 Dec 2024 12:44:47 -0700 Subject: [PATCH 04/12] Cleaned up to not include the beginnings of the Secrets doc --- README.md | 1 - docs/GithubSecrets.md | 217 ------------------------------------------ 2 files changed, 218 deletions(-) delete mode 100644 docs/GithubSecrets.md diff --git a/README.md b/README.md index 6c29448a..f2afa4f2 100644 --- a/README.md +++ b/README.md @@ -33,7 +33,6 @@ Resources: - [Running from the Command Line](docs/CommandLine.md) - How to run the benchmark jar with a test package - [Running Adhoc Github Workflows](docs/AdhocWorkflows.md) - Running benchmark sets on-demand from Github - [Published Results Storage](docs/PublishedResults.md) - How to grab and use Deephaven's published benchmarks -- [Sssh Secrets](docs/GithubSecrets.md) - How to define Github secrets for running Benchmark workflows in a fork ## Other Deephaven Summaries diff --git a/docs/GithubSecrets.md b/docs/GithubSecrets.md deleted file mode 100644 index b241c189..00000000 --- a/docs/GithubSecrets.md +++ /dev/null @@ -1,217 +0,0 @@ -# Collected Results - -Running Benchmark tests in an IDE produces results in a directory structure in the current (working) directory. Running the same tests -from the command line through the deephaven-benchmark uber-jar yields the same directory structure for each run while accumulating the -data from repeated runs instead of overwriting it. - -### Example IDE-driven Directory Structure -```` -results/ - benchmark-metrics.csv - benchmark-platform.csv - benchmark-results.csv - test-logs/ - io.deephaven.benchmark.tests.query.examples.stream.JoinTablesFromKafkaStream.query.md - io.deephaven.benchmark.tests.query.examples.stream.JoinTablesFromParquetAndStream.query.md -```` - -### Example Command-line-driven Directory Structure -```` -results/ - run-17d06a7611 - benchmark-metrics.csv - benchmark-platform.csv - benchmark-results.csv - test-logs/ - io.deephaven.benchmark.tests.query.examples.stream.JoinTablesFromKafkaStream.query.md - io.deephaven.benchmark.tests.query.examples.stream.JoinTablesFromParquetAndStream.query.md - run-17d9ec2f2e - benchmark-metrics.csv - benchmark-platform.csv - benchmark-results.csv - test-logs/ - io.deephaven.benchmark.tests.query.examples.stream.JoinTablesFromKafkaStream.query.md - io.deephaven.benchmark.tests.query.examples.stream.JoinTablesFromParquetAndStream.query.md -```` - -What does each file mean? -- run\-\: Time-based unique id for the batch of tests that was run -- benchmark-metrics.csv: MXBean and other metrics collected over the test run -- benchmark-platform.csv: Various VM and hardware details for the components of the test system -- benchmark-results.csv: Query rates for the running tests at scale -- test-logs: Directory containing details about each test run according to test class -- \*.query.md: A log showing the queries that where executed to complete each test in the order they were executed - -## Benchmark Platform CSV - -The benchmark-platform csv contains details for the system/VM where the test runner and Deephave Engine ran. -Details include Available Processors, JVM version, OS version, etc. - -Columns defined in the file are: -- origin: The service where the property was collected -- name: The property name -- value: The property value for the service - -Properties defined in the file are: -- java.version: The version of java running the application -- java.vm.name: The name of the java virtual machine running the application -- java.class.version: The class version the java VM supports -- os.name: The name of the operating system hosting the application -- os.version: The version of the operating system hosting the application -- available.processors: The number of CPUs the application is allowed to use -- java.max.memory: Maximum amount of memory the application is allowed to use -- python.version: The version of python used in the Deephaven Engine -- deephaven.version: The version of Deephaven tested against (client and server may be different) - - -### Example benchmark-platform.csv -```` -origin, name, value -test-runner, java.version, 17.0.1 -test-runner, java.vm.name, OpenJDK 64-Bit Server VM -test-runner, java.class.version, 61 -test-runner, os.name, Windows 10 -test-runner, os.version, 10 -test-runner, available.processors, 16 -test-runner, java.max.memory, 15.98G -test-runner, deephaven.version, 0.22.0 -deephaven-engine, java.version, 17.0.5 -deephaven-engine, java.vm.name, OpenJDK 64-Bit Server VM -deephaven-engine, java.class.version, 61 -deephaven-engine, os.name, Linux -deephaven-engine, os.version, 5.15.79.1-microsoft-standard-WSL2 -deephaven-engine, available.processors, 12 -deephaven-engine, java.max.memory, 42.00G -deephaven-engine, python.version, 3.10.6 -```` - -## Benchmark Results CSV - -The benchmark-results.csv contains measurements taken of the course of each test run. One row is listed for each test. - -Fields supplied in the file are: -- benchmark_name: The unique name of the benchmark -- origin: The serice where the benchmark was collected -- timestamp: Millis since epoch at the beginning of the benchmark -- test_duration: Seconds elapsed for the entire test run including setup and teardown -- op_duration: Seconds elapsed for the operation under measurement -- op_rate: Processing rate supplied by the test-writer -- row_count: The number of rows processed by the operation - -### Example benchmark-results.csv -```` -benchmark_name,origin,timestamp,test_duration,op_duration,op_rate,row_count -Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926395799,155.4960,5.4450,367309458,2000000000 -Select- 1 Calc Using 2 Cols -Inc,deephaven-engine,1683926551491,12.1970,3.5900,111420612,400000000 -Select- 2 Cals Using 2 Cols -Static,n/a,1683926563714,8.7480,8.7480,1143118,10000000 -SelectDistinct- 1 Group 250 Unique Vals -Static,deephaven-engine,1683926572487,195.0420,18.5530,64679566,1200000000 -```` - -## Benchmark Metrics CSV - -The benchmark-metrics.csv contains metrics collected while running the benchmark. Most metrics (like MXBean metrics) represent a snapshot -at a moment in time. When these snapshots are taken and collected are up to the test-writer. For example, in the standard benchmarks -available in this project, metrics snaphosts are taken before and after the benchmark operation. The before-after metrics can be compared -to calculate things like Heap Gain or garbage collection counts. - -Field supplied in the file are: -- benchmark_name: The unique name of the benchmark -- origin: The serice where the metric was collected -- timestamp: Millis since epoch when the metrics was recorded -- category: A grouping category for the metric -- type: The type of metric has been collected (should be more narrowly focused than category) -- name: A metric name that is unique within the category -- value: The numeric value of the metric -- note: Any addition clarifying information - -## Example benchmark-metrics.csv -```` -benchmark_name,origin,timestamp,category,type,name,value,note -Select- 1 Calc Using 2 Cols -Static,test-runner,1683926402500,setup,docker,restart,6700,standard -Select- 1 Calc Using 2 Cols -Static,test-runner,1683926479998,source,generator,duration.secs,74.509, -Select- 1 Calc Using 2 Cols -Static,test-runner,1683926479998,source,generator,record.count,50000000, -Select- 1 Calc Using 2 Cols -Static,test-runner,1683926479998,source,generator,send.rate,671059.8719617764, -Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,ClassLoadingImpl,ClassLoading,TotalLoadedClassCount,13172, -Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,ClassLoadingImpl,ClassLoading,UnloadedClassCount,16, -Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,MemoryImpl,Memory,ObjectPendingFinalizationCount,0, -Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,MemoryImpl,Memory,HeapMemoryUsage Committed,9126805504, -Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,MemoryImpl,Memory,HeapMemoryUsage Init,1157627904, -Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,MemoryImpl,Memory,HeapMemoryUsage Max,25769803776, -Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,MemoryImpl,Memory,HeapMemoryUsage Used,2963455008, -Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,MemoryImpl,Memory,NonHeapMemoryUsage Committed,106430464, -Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,MemoryImpl,Memory,NonHeapMemoryUsage Init,7667712, -Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,MemoryImpl,Memory,NonHeapMemoryUsage Max,-1, -Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,MemoryImpl,Memory,NonHeapMemoryUsage Used,101707208, -Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,HotSpotThreadImpl,Threading,ThreadCount,50, -Select- 1 Calc Using 2 Cols -Static,deephaven-engine,1683926545385,HotSpotThreadImpl,Threading,PeakThreadCount,50, -```` - -## Query Log - -Query logs record queries in the order in which they were run during a test. These include queries run by the framework automatically behind the scenes. -Any property variables supplied in a query are replaced with the actual values used during the test run. After a test is run, it is usually possible to -copy and paste the query into the Deephaven UI and run it, because parquet and kafka topic data is left intact. However, if Test B is run after Test A, -running the recorded queries for Test A may not work, since cleanup is done automatically between tests. -The log is in Markdown format for easier viewing. - -### Example Query Log -~~~~ -# Test Class - io.deephaven.benchmark.tests.internal.examples.stream.JoinTablesFromKafkaStreamTest - -## Test - Count Records From Kakfa Stream - -### Query 1 -```` -from deephaven import kafka_consumer as kc -from deephaven.stream.kafka.consumer import TableType, KeyValueSpec - -def bench_api_kafka_consume(topic: str, table_type: str): - t_type = None - if table_type == 'append': t_type = TableType.append() - elif table_type == 'blink': t_type = TableType.blink() - elif table_type == 'ring': t_type = TableType.ring() - else: raise Exception('Unsupported kafka stream type: {}'.format(t_type)) - - return kc.consume( - { 'bootstrap.servers' : 'redpanda:29092', 'schema.registry.url' : 'http://redpanda:8081' }, - topic, partitions=None, offsets=kc.ALL_PARTITIONS_SEEK_TO_BEGINNING, - key_spec=KeyValueSpec.IGNORE, value_spec=kc.avro_spec(topic + '_record', schema_version='1'), - table_type=t_type) - -from deephaven.table import Table -from deephaven.ugp import exclusive_lock - -def bench_api_await_table_size(table: Table, row_count: int): - with exclusive_lock(): - while table.j_table.size() < row_count: - table.j_table.awaitUpdate() - -```` - -### Query 2 -```` -kafka_stock_trans = bench_api_kafka_consume('stock_trans', 'append') -bench_api_await_table_size(kafka_stock_trans, 100000) -```` - -### Query 3 -```` -kafka_stock_trans=None -from deephaven import garbage_collect; garbage_collect() -```` -~~~~ - -The above log is an example of what you would see in *test‑logs/io.deephaven.benchmark.tests.internal.examples.stream.JoinTablesFromKafkaStreamTest.query.md* -when running the *JoinTablesFromKafkaStreamTest.countRecordsFromKafkaStream()* integration test. - -The log has several components: -- Test Class: The fully-qualified class name of the test -- Test: The description of the test the test-writer supplied -- Query 1,2,3: The Deephaven queries executed in Deephaven in the order they were executed - -What the queries are doing: -- Query 1: The definitions of some the Bench API convenience functions used in Query 2. The test-writer used *bench_api_kafka_consume()* and -*bench_api_await_table_size()* in the query, and the Bench API automatically added the corresponding function definitions -- Query 2: The part that the test-writer wrote for the test (see *JoinTablesFromKafkaStreamTest.countRecordsFromKafkaStream()*) -- Query 3: The cleanup query added automatically by the Bench API - From f13a97cca416d279f34abee4c1e888bcf3081d25 Mon Sep 17 00:00:00 2001 From: stanbrub Date: Thu, 12 Dec 2024 12:57:07 -0700 Subject: [PATCH 05/12] Removed Secrets link since that is a part of another task --- docs/AdhocWorkflows.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/AdhocWorkflows.md b/docs/AdhocWorkflows.md index f67c949c..a83b05aa 100644 --- a/docs/AdhocWorkflows.md +++ b/docs/AdhocWorkflows.md @@ -7,9 +7,10 @@ A common practice is to run a comparison from a source branch that is ready for All results are stored according to the initiating user and a user-supplied label in the public [Benchmarking GCloud bucket](https://console.cloud.google.com/storage/browser/deephaven-benchmark). Though the results are available through public URLs, Google Cloud browsing is not. Retrieval of the generated data is mainly the domain of the Adhoc Dashboard. Prerequisites: -- Permission to use Deephaven's Bare Metal servers and [Github Secrets](./GithubSecrets.md) +- Permission to use Deephaven's Bare Metal servers and Github Secrets - An installation of a [Deephaven Community Core w/ Python docker image](https://deephaven.io/core/docs/getting-started/docker-install/) (0.36.1+) - The Adhoc Dashboard python snippet shown in this guide +- Access to the [Benchmark Workflow Actions](https://github.com/deephaven/benchmark/actions) ### Common Workflow UI Fields From 173c5211b5201ad9e63727bd84ae84f319fe698b Mon Sep 17 00:00:00 2001 From: Stan Brubaker <120737309+stanbrub@users.noreply.github.com> Date: Fri, 13 Dec 2024 16:14:43 -0700 Subject: [PATCH 06/12] Update docs/TestingConcepts.md Co-authored-by: rachelmbrubaker <84355699+rachelmbrubaker@users.noreply.github.com> --- docs/TestingConcepts.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/TestingConcepts.md b/docs/TestingConcepts.md index cbb8d0e3..44d94885 100644 --- a/docs/TestingConcepts.md +++ b/docs/TestingConcepts.md @@ -1,6 +1,6 @@ # Benchmarking Concepts -Benchmark is designed to work easily in different contexts like within and IDE, from the command line, and as part of Github workflows. However, there is more to writing benchmarks than getting a test working and measuring between two points. Is it reproducible? Does it perform similarly at different scales? What about from day to day? Is setup included in the measurement? Is it easy to add to and maintain? +Benchmark is designed to work easily in different contexts like within an IDE, from the command line, and as part of Github workflows. However, there is more to writing benchmarks than getting a test working and measuring between two points. Is it reproducible? Does it perform similarly at different scales? What about from day to day? Is setup included in the measurement? Is it easy to add to and maintain? What follows are some concepts that guide the development of Benchmark meant to kept it versatile, simple, and relevant. From 0a32ef9d4a8804db3d025a10b4503bcbdecd3e3c Mon Sep 17 00:00:00 2001 From: Stan Brubaker <120737309+stanbrub@users.noreply.github.com> Date: Fri, 17 Jan 2025 16:56:13 -0700 Subject: [PATCH 07/12] Update AdhocWorkflows.md --- docs/AdhocWorkflows.md | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/docs/AdhocWorkflows.md b/docs/AdhocWorkflows.md index a83b05aa..b8c20753 100644 --- a/docs/AdhocWorkflows.md +++ b/docs/AdhocWorkflows.md @@ -12,11 +12,21 @@ Prerequisites: - The Adhoc Dashboard python snippet shown in this guide - Access to the [Benchmark Workflow Actions](https://github.com/deephaven/benchmark/actions) +### Starting a Benchmark Run + +The typical Benchmark run will be initiated from a Github on-demand workflow action in the main [Deephaven Benchmark](https://github.com/deephaven/benchmark/actions). However, the "Adhoc" workflows can be run from a fork as well, assuming the correct privileges are set up. (This is outside the scope of this document.) + +There are two Adhoc Workflows: +- [Adhoc Benchmarks (Auto-provisioned Server)](https://github.com/deephaven/benchmark/actions/workflows/adhoc-auto-remote-benchmarks.yml) +- [Adhoc Benchmarks (Existing Server) ](https://github.com/deephaven/benchmark/actions/workflows/adhoc-exist-remote-benchmarks.yml) + +Of the two workflows, privileged users will use the "Auto-provisioned Server" workflow in the vast majority of cases. Using the "Existing Server" workflow requires a dedicated server and extra setup. + ### Common Workflow UI Fields The ui fields used for both Adhoc workflows that are common are defined below: - Use workflow from - - Select the branch where the desired benchmarks are. This is typically "main" but could be a branch in a fork + - From the workflow dropdown, select the branch where the desired benchmarks are. This is typically "main" but could be a branch in a fork - Deephaven Image or Core Branch - The [Deephaven Core](https://github.com/deephaven/deephaven-core) branch, commit hash, tag, or docker image/sha - ex. Branch: `deephaven:main or myuser:mybranch` @@ -27,7 +37,7 @@ The ui fields used for both Adhoc workflows that are common are defined below: - Benchmark Test Classes - Wildcard names of available test classes. For example, `Avg*` will match the AvgByTest - Because of the nature of the benchmark runner, there is no way to select individual tests by name - - Test classes can be found under in the [standard test directory](https://github.com/deephaven/benchmark/tree/main/src/it/java/io/deephaven/benchmark/tests/standard) + - Test classes can be found under the [standard test directory](https://github.com/deephaven/benchmark/tree/main/src/it/java/io/deephaven/benchmark/tests/standard) - Benchmark Iterations - The number of iterations to run for each benchmark. Be careful, large numbers may take hours or days - Given that the Adhoc Dashboard uses medians, any even numbers entered here will be incremented From 3458be4028c46f0e503825daf636ab0391271d9e Mon Sep 17 00:00:00 2001 From: Stan Brubaker <120737309+stanbrub@users.noreply.github.com> Date: Fri, 17 Jan 2025 17:01:20 -0700 Subject: [PATCH 08/12] Update AdhocWorkflows.md --- docs/AdhocWorkflows.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/AdhocWorkflows.md b/docs/AdhocWorkflows.md index b8c20753..af260588 100644 --- a/docs/AdhocWorkflows.md +++ b/docs/AdhocWorkflows.md @@ -14,7 +14,7 @@ Prerequisites: ### Starting a Benchmark Run -The typical Benchmark run will be initiated from a Github on-demand workflow action in the main [Deephaven Benchmark](https://github.com/deephaven/benchmark/actions). However, the "Adhoc" workflows can be run from a fork as well, assuming the correct privileges are set up. (This is outside the scope of this document.) +The typical Benchmark run will be initiated from a [Github on-demand workflow](https://docs.github.com/en/actions/managing-workflow-runs-and-deployments/managing-workflow-runs/manually-running-a-workflow) action in the main [Deephaven Benchmark](https://github.com/deephaven/benchmark/actions). However, the "Adhoc" workflows can be run from a fork as well, assuming the correct privileges are set up. (This is outside the scope of this document.) There are two Adhoc Workflows: - [Adhoc Benchmarks (Auto-provisioned Server)](https://github.com/deephaven/benchmark/actions/workflows/adhoc-auto-remote-benchmarks.yml) From c409850d25b6a8ae490b7dc61f0fa9b31f2a867d Mon Sep 17 00:00:00 2001 From: Stan Brubaker <120737309+stanbrub@users.noreply.github.com> Date: Fri, 17 Jan 2025 17:10:57 -0700 Subject: [PATCH 09/12] Update TestingConcepts.md --- docs/TestingConcepts.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/docs/TestingConcepts.md b/docs/TestingConcepts.md index 44d94885..ef10ae9e 100644 --- a/docs/TestingConcepts.md +++ b/docs/TestingConcepts.md @@ -8,8 +8,7 @@ What follows are some concepts that guide the development of Benchmark meant to The *Bench* API uses the builder pattern to guide the test writer in generating data, executing queries, and fetching results. There is a single API entry point where a user can follow the dots and look at the code-insight and Javadocs that pop up in the IDE. Default properties can be overriden by builder-style "with" methods like *withRowCount()*. A middle ground is taken between text configuration and configuration fully-expressed in code to keep things simple and readable. ### Scale Rather Than Iterations -Repeating tests can be useful for testing the effects of caching (e.g. load file multiple times; is it faster on subsequent loads?), or overcoming a lack of precision in OS timers (e.g. run a fast function many times and average), or average out variability between runs (there are always anomalies). On the other hand, if the context of the test is processing large data sets, then it's better to measure against large data sets where possible. This provides a benchmark test that's closer to the real thing when it comes to memory consumption, garbage collection, thread usage, and JIT optimizations. Repeating tests, though useful in some scenarios, can have the effect of taking the operation under test out of the benchmark equation because of cached results, resets for each iteration, -limited heap usage, or smaller data sets that are too uniform. +Repeating tests can be useful for testing the effects of caching (e.g. load file multiple times; is it faster on subsequent loads?), or overcoming a lack of precision in OS timers (e.g. run a fast function many times and average), or average out variability between runs (there are always anomalies). On the other hand, if the context of the test is processing large data sets, then it's better to measure against large data sets where possible. This provides a benchmark test that's closer to the real thing when it comes to memory consumption, garbage collection, thread usage, and JIT optimizations. Repeating tests, though useful in some scenarios, can have the effect of taking the operation under test out of the benchmark equation because of cached results, resets for each iteration, limited heap usage, or smaller data sets that are too uniform. ### Adjust Scale For Each Test When measuring a full set of benchmarks for transforming data, some benchmarks will naturally be faster than others (e.g. sums vs joins). Running all benchmarks at the same scale (e.g. 10 million rows) could yield results where one benchmark takes a minute and another takes 100 milliseconds. Is the 100 ms test meaningful, especially when measured in a JVM? Not really, because there is no time to assess the impact of JVM ergonomics or the effect of OS background tasks. A better way is to set scale multipliers to amplify row count for tests that need it. @@ -21,8 +20,7 @@ Want to know what tables and operations the test uses? Go to the test. Want to k Tests are developed by test-writers, so why not make it easy for them? Run tests from the IDE for ease of debugging. Point the tests to a local or a remote Deephaven Server instance. Or package tests in a jar and run them locally or remotely from the Benchmark uber-jar. The same tests should work whether running everything on the same system or different system. ### Measure Where It Matters -The Benchmark framework allows the test-writer to set each benchmark measurement from the test code instead of relying on a mechanism that measures automatically behind the scenes. Measurements can be taken across the execution of the test locally with a *Timer* like in the -[JoinTablesFromKafkaStreamTest](src/it/java/io/deephaven/benchmark/tests/internal/examples/stream/JoinTablesFromKafkaStreamTest.java) example test or fetched from the remote Deephaven instance where the test is running as is done in the [StandardTestRunner](src/it/java/io/deephaven/benchmark/tests/standard/StandardTestRunner.java) used for nightly Deephaven benchmarks. Either way the submission of the result to the Benchmark framework is under the test-writer's control. +The Benchmark framework allows the test-writer to set each benchmark measurement from the test code instead of relying on a mechanism that measures automatically behind the scenes. Measurements can be taken across the execution of the test locally with a *Timer* like in the [JoinTablesFromKafkaStreamTest](src/it/java/io/deephaven/benchmark/tests/internal/examples/stream/JoinTablesFromKafkaStreamTest.java) example test or fetched from the remote Deephaven instance where the test is running as is done in the [StandardTestRunner](src/it/java/io/deephaven/benchmark/tests/standard/StandardTestRunner.java) used for nightly Deephaven benchmarks. Either way the submission of the result to the Benchmark framework is under the test-writer's control. ### Preserve and Compare Most benchmarking efforts involve a fixed timeline for "improving performance" rather than tracking the performance impacts of code changes day-to-day. This can be effective, but how do you know if future code changes are degrading performance unless benchmarking is done every day. A better way is to preserve benchmarking results every day and compare to the results from the previous week. To avoid death by a thousand cuts, compare release to release as well for the same benchmarks. Also, compare benchmarks for your product with other products for equivalent operations. From 0131e1d3f3edfaef1dd744ea5376a211326fd236 Mon Sep 17 00:00:00 2001 From: Stan Brubaker <120737309+stanbrub@users.noreply.github.com> Date: Fri, 17 Jan 2025 17:13:56 -0700 Subject: [PATCH 10/12] Update TestingConcepts.md --- docs/TestingConcepts.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/TestingConcepts.md b/docs/TestingConcepts.md index ef10ae9e..fec8b262 100644 --- a/docs/TestingConcepts.md +++ b/docs/TestingConcepts.md @@ -20,7 +20,7 @@ Want to know what tables and operations the test uses? Go to the test. Want to k Tests are developed by test-writers, so why not make it easy for them? Run tests from the IDE for ease of debugging. Point the tests to a local or a remote Deephaven Server instance. Or package tests in a jar and run them locally or remotely from the Benchmark uber-jar. The same tests should work whether running everything on the same system or different system. ### Measure Where It Matters -The Benchmark framework allows the test-writer to set each benchmark measurement from the test code instead of relying on a mechanism that measures automatically behind the scenes. Measurements can be taken across the execution of the test locally with a *Timer* like in the [JoinTablesFromKafkaStreamTest](src/it/java/io/deephaven/benchmark/tests/internal/examples/stream/JoinTablesFromKafkaStreamTest.java) example test or fetched from the remote Deephaven instance where the test is running as is done in the [StandardTestRunner](src/it/java/io/deephaven/benchmark/tests/standard/StandardTestRunner.java) used for nightly Deephaven benchmarks. Either way the submission of the result to the Benchmark framework is under the test-writer's control. +The Benchmark framework allows the test-writer to set each benchmark measurement from the test code instead of relying on a mechanism that measures automatically behind the scenes. Measurements can be taken across the execution of the test locally with a *Timer* like in the [JoinTablesFromKafkaStreamTest](../src/it/java/io/deephaven/benchmark/tests/internal/examples/stream/JoinTablesFromKafkaStreamTest.java) example test or fetched from the remote Deephaven instance where the test is running as is done in the [StandardTestRunner](../src/it/java/io/deephaven/benchmark/tests/standard/StandardTestRunner.java) used for nightly Deephaven benchmarks. Either way the submission of the result to the Benchmark framework is under the test-writer's control. ### Preserve and Compare Most benchmarking efforts involve a fixed timeline for "improving performance" rather than tracking the performance impacts of code changes day-to-day. This can be effective, but how do you know if future code changes are degrading performance unless benchmarking is done every day. A better way is to preserve benchmarking results every day and compare to the results from the previous week. To avoid death by a thousand cuts, compare release to release as well for the same benchmarks. Also, compare benchmarks for your product with other products for equivalent operations. From 98d8c1f2d0efb98473487696f946884e93e69022 Mon Sep 17 00:00:00 2001 From: Stan Brubaker <120737309+stanbrub@users.noreply.github.com> Date: Fri, 17 Jan 2025 17:14:42 -0700 Subject: [PATCH 11/12] Update TestingConcepts.md --- docs/TestingConcepts.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/TestingConcepts.md b/docs/TestingConcepts.md index fec8b262..38f9a283 100644 --- a/docs/TestingConcepts.md +++ b/docs/TestingConcepts.md @@ -20,7 +20,7 @@ Want to know what tables and operations the test uses? Go to the test. Want to k Tests are developed by test-writers, so why not make it easy for them? Run tests from the IDE for ease of debugging. Point the tests to a local or a remote Deephaven Server instance. Or package tests in a jar and run them locally or remotely from the Benchmark uber-jar. The same tests should work whether running everything on the same system or different system. ### Measure Where It Matters -The Benchmark framework allows the test-writer to set each benchmark measurement from the test code instead of relying on a mechanism that measures automatically behind the scenes. Measurements can be taken across the execution of the test locally with a *Timer* like in the [JoinTablesFromKafkaStreamTest](../src/it/java/io/deephaven/benchmark/tests/internal/examples/stream/JoinTablesFromKafkaStreamTest.java) example test or fetched from the remote Deephaven instance where the test is running as is done in the [StandardTestRunner](../src/it/java/io/deephaven/benchmark/tests/standard/StandardTestRunner.java) used for nightly Deephaven benchmarks. Either way the submission of the result to the Benchmark framework is under the test-writer's control. +The Benchmark framework allows the test-writer to set each benchmark measurement from the test code instead of relying on a mechanism that measures automatically behind the scenes. Measurements can be taken across the execution of the test locally with a *Timer* like in the [JoinTablesFromKafkaStreamTest](/src/it/java/io/deephaven/benchmark/tests/internal/examples/stream/JoinTablesFromKafkaStreamTest.java) example test or fetched from the remote Deephaven instance where the test is running as is done in the [StandardTestRunner](/src/it/java/io/deephaven/benchmark/tests/standard/StandardTestRunner.java) used for nightly Deephaven benchmarks. Either way the submission of the result to the Benchmark framework is under the test-writer's control. ### Preserve and Compare Most benchmarking efforts involve a fixed timeline for "improving performance" rather than tracking the performance impacts of code changes day-to-day. This can be effective, but how do you know if future code changes are degrading performance unless benchmarking is done every day. A better way is to preserve benchmarking results every day and compare to the results from the previous week. To avoid death by a thousand cuts, compare release to release as well for the same benchmarks. Also, compare benchmarks for your product with other products for equivalent operations. From b3ab685864e4d79c37a9446abc1a037f91cb5b55 Mon Sep 17 00:00:00 2001 From: Stan Brubaker <120737309+stanbrub@users.noreply.github.com> Date: Fri, 17 Jan 2025 17:29:38 -0700 Subject: [PATCH 12/12] Update TestingConcepts.md --- docs/TestingConcepts.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/TestingConcepts.md b/docs/TestingConcepts.md index 38f9a283..7605ba95 100644 --- a/docs/TestingConcepts.md +++ b/docs/TestingConcepts.md @@ -11,7 +11,7 @@ The *Bench* API uses the builder pattern to guide the test writer in generating Repeating tests can be useful for testing the effects of caching (e.g. load file multiple times; is it faster on subsequent loads?), or overcoming a lack of precision in OS timers (e.g. run a fast function many times and average), or average out variability between runs (there are always anomalies). On the other hand, if the context of the test is processing large data sets, then it's better to measure against large data sets where possible. This provides a benchmark test that's closer to the real thing when it comes to memory consumption, garbage collection, thread usage, and JIT optimizations. Repeating tests, though useful in some scenarios, can have the effect of taking the operation under test out of the benchmark equation because of cached results, resets for each iteration, limited heap usage, or smaller data sets that are too uniform. ### Adjust Scale For Each Test -When measuring a full set of benchmarks for transforming data, some benchmarks will naturally be faster than others (e.g. sums vs joins). Running all benchmarks at the same scale (e.g. 10 million rows) could yield results where one benchmark takes a minute and another takes 100 milliseconds. Is the 100 ms test meaningful, especially when measured in a JVM? Not really, because there is no time to assess the impact of JVM ergonomics or the effect of OS background tasks. A better way is to set scale multipliers to amplify row count for tests that need it. +When measuring a full set of benchmarks for transforming data, some benchmarks will naturally be faster than others (e.g. sums vs joins). Running all benchmarks at the same scale (e.g. 10 million rows) could yield results where one benchmark takes a minute and another takes 100 milliseconds. Is the 100 ms test meaningful, especially when measured in a JVM? Not really, because there is no time to assess the impact of JVM ergonomics or the effect of OS background tasks. A better way is to set scale multipliers to amplify row count for tests that need it and aim for a meaningful test duration. ### Test-centric Design Want to know what tables and operations the test uses? Go to the test. Want to know what the framework is doing behind the scenes? Step through the test. Want to run one or more tests? Start from the test rather than configuring an external tool and deploying to that. Let the framework handle the hard part. The point is that a benchmark test against a remote server should be as easy and clear to write as a unit test. As far as is possible, data generation should be defined in the same place it's used... in the test.