-
Notifications
You must be signed in to change notification settings - Fork 18
Backend ‐ Metrics
This captures plugin’s activity data, which includes Install Activity metrics and GitHub Activity metrics.
The architecture consists of three major parts:
- The external datastore that holds the complete dataset.
- The data fetch workflow keeps the data fresh.
- The API that surfaces the data to the client.
This is a snowflake system that is managed by the data science team. The data science team refreshes the data by appending new data to the table. This update happens every day at 5:00 AM.
The data is fetched from pip’s statistics. A view is created on top of the table’s data to filter out all the installs done by CI/CD pipelines. This helps provide a closer estimate of individual users installing the plugin for their processing.
The data, which represents all commit activity on the users' GitHub repos, is fetched from GitHub’s statistics.
imaging.github.commits
table in Snowflake is where GitHub data is stored, and contains columns such as repo, author id, commit timestamp, commit message, repo URL, and ingestion timestamp.
The data fetch workflow lives in the data-workflows code space. It queries snowflake, transforms the data, and writes it to the relevant dynamo db. It has write access to the relevant DynamoDB tables.
A CloudWatch event bridge rule schedules the workflow as a cron job. The rule is scheduled to run at 13:00 UTC daily after the data science team’s workflow updates the tables. The rule publishes the following JSON message to the SQS queue:
{"type": "activity"}
The message acts as a trigger to the lambda. It allows the message to be reprocessed in case of failures in the lambda.
It stores the information that needs to be passed between subsequent iterations of the lambda run for activity workflow processing. This activity process stores the timestamp until which all activity ingested in the snowflake has been processed successfully. It is stored as the value for last_activity_fetched_timestamp
.
The data is stored in the install-activity
dynamo table, and here is the schema.
The data is stored in the github-activity
dynamo table, and here is the schema.
- The timestamp of the last run is fetched from the parameter store and used as start_time for the window and the current time is used as end_time.
- The view from snowflake (
imaging.pypi.labeled_downloads
) is then queried to get the plugin names with the earliest install activity added between the last run(start_time) and now(end_time). - Since their earliest install activity, the day, month, and total level granularity of data is computed for all the plugins returned in the previous query.
- The records fetched are transformed into the dynamo records of relevant types.
- The records are batch written to the
install-activity
dynamo table.
- The table from snowflake (
imaging.github.commits
) is then queried to get the plugin names with the earliest github activity added between the last run(start_time) and now(end_time). - Since their earliest install activity, the latest, month, and total level granularity of data is computed for all the plugins returned in the previous query.
- The records fetched are transformed into the dynamo records of relevant types.
- The records are batch written to the
github-activity
dynamo table.
- On successful completion of the workflow, the parameter store is updated with the end_time used in the workflow.
GetItem for records from install-activity
table with key condition name=:plugin_name AND type_timestamp=‘TOTAL:’
and projection installs
.
Query for records from install-activity
table with key condition expression name=:plugin_name AND type_timestamp BETWEEN :start_date AND :end_date
’ and projection installs. The start_date and end_date are computed dynamically, to reflect the last 30 days.
Query for records from install-activity
table with key condition expression name=:plugin_name AND type_timestamp BETWEEN :start_month AND :end_month
and projections installs
and timestamp
. The start_date and end_date are computed dynamically, to reflect the number of months over which the timeline data is needed.
GetItem for records from the GitHub activity table with key condition name=:plugin_name AND type_identifier=f"TOTAL: {repo_name}"
and projection commits
.
Query for records from GitHub activity table with key condition expression name=:plugin_name AND type_identifier=f"LATEST:{repo_name}"
and projection commits
. The start_date and end_date are computed dynamically, to reflect the latest commit.
Query for records from github-activity
table with key condition expression name=:plugin_name AND type_identifier BETWEEN f"MONTH:{start_month}:{repo_name}" AND f"MONTH:{end_month}:{repo_name}"
and projections commits
and timestamp
. The start_month and end_month are computed dynamically to reflect the number of months over which the timeline data is needed.
Step 1: Launch aws
and search for Parameter Store
and click on it.
Step 2: Search for napari-hub/data-workflows/config
.
Step 3: To backfill the entire data, set the last_activity_fetched_timestamp
variable to 0 in the staging and prod environments to kickstart the workflow.
If any issue occurs from backfilling the data, look through Lambda logs to pinpoint the problem.