Skip to content

Backend ‐ Metrics

Jeremy Asuncion edited this page Sep 12, 2023 · 1 revision

What is it?

This captures plugin’s activity data, which includes Install Activity metrics and GitHub Activity metrics.

Architecture

Screenshot 2023-05-09 at 2 11 49 PM

The architecture consists of three major parts:

  1. The external datastore that holds the complete dataset.
  2. The data fetch workflow keeps the data fresh.
  3. The API that surfaces the data to the client.

The external datastore

This is a snowflake system that is managed by the data science team. The data science team refreshes the data by appending new data to the table. This update happens every day at 5:00 AM.

Install Activity

The data is fetched from pip’s statistics. A view is created on top of the table’s data to filter out all the installs done by CI/CD pipelines. This helps provide a closer estimate of individual users installing the plugin for their processing.

GitHub Activity

The data, which represents all commit activity on the users' GitHub repos, is fetched from GitHub’s statistics. imaging.github.commits table in Snowflake is where GitHub data is stored, and contains columns such as repo, author id, commit timestamp, commit message, repo URL, and ingestion timestamp.

Data fetch

The data fetch workflow lives in the data-workflows code space. It queries snowflake, transforms the data, and writes it to the relevant dynamo db. It has write access to the relevant DynamoDB tables.

CloudWatch event rule

A CloudWatch event bridge rule schedules the workflow as a cron job. The rule is scheduled to run at 13:00 UTC daily after the data science team’s workflow updates the tables. The rule publishes the following JSON message to the SQS queue:

{"type": "activity"}

SQS Message

The message acts as a trigger to the lambda. It allows the message to be reprocessed in case of failures in the lambda.

Parameter Store

It stores the information that needs to be passed between subsequent iterations of the lambda run for activity workflow processing. This activity process stores the timestamp until which all activity ingested in the snowflake has been processed successfully. It is stored as the value for last_activity_fetched_timestamp.

Dynamo

Install Activity

The data is stored in the install-activity dynamo table, and here is the schema.

GitHub Activity

The data is stored in the github-activity dynamo table, and here is the schema.

Lambda

Fetching start timestamp of query window:

  • The timestamp of the last run is fetched from the parameter store and used as start_time for the window and the current time is used as end_time.

Processing for Install Activity:

  • The view from snowflake (imaging.pypi.labeled_downloads) is then queried to get the plugin names with the earliest install activity added between the last run(start_time) and now(end_time).
  • Since their earliest install activity, the day, month, and total level granularity of data is computed for all the plugins returned in the previous query.
  • The records fetched are transformed into the dynamo records of relevant types.
  • The records are batch written to the install-activity dynamo table.

Processing for GitHub Activity:

  • The table from snowflake (imaging.github.commits) is then queried to get the plugin names with the earliest github activity added between the last run(start_time) and now(end_time).
  • Since their earliest install activity, the latest, month, and total level granularity of data is computed for all the plugins returned in the previous query.
  • The records fetched are transformed into the dynamo records of relevant types.
  • The records are batch written to the github-activity dynamo table.

Storing end timestamp of query window:

  • On successful completion of the workflow, the parameter store is updated with the end_time used in the workflow.

Install Activity API

total_installs:

GetItem for records from install-activity table with key condition name=:plugin_name AND type_timestamp=‘TOTAL:’ and projection installs.

installs_in_last_30_days:

Query for records from install-activity table with key condition expression name=:plugin_name AND type_timestamp BETWEEN :start_date AND :end_date’ and projection installs. The start_date and end_date are computed dynamically, to reflect the last 30 days.

timeline:

Query for records from install-activity table with key condition expression name=:plugin_name AND type_timestamp BETWEEN :start_month AND :end_month and projections installs and timestamp. The start_date and end_date are computed dynamically, to reflect the number of months over which the timeline data is needed.

GitHub Activity API

total_commits:

GetItem for records from the GitHub activity table with key condition name=:plugin_name AND type_identifier=f"TOTAL: {repo_name}" and projection commits.

latest_commit:

Query for records from GitHub activity table with key condition expression name=:plugin_name AND type_identifier=f"LATEST:{repo_name}" and projection commits. The start_date and end_date are computed dynamically, to reflect the latest commit.

timeline:

Query for records from github-activity table with key condition expression name=:plugin_name AND type_identifier BETWEEN f"MONTH:{start_month}:{repo_name}" AND f"MONTH:{end_month}:{repo_name}" and projections commits and timestamp. The start_month and end_month are computed dynamically to reflect the number of months over which the timeline data is needed.

Backfill

Step 1: Launch aws and search for Parameter Store and click on it. Screenshot 2023-05-01 at 12 32 24 PM

Step 2: Search for napari-hub/data-workflows/config. Screenshot 2023-05-01 at 12 37 04 PM

Step 3: To backfill the entire data, set the last_activity_fetched_timestamp variable to 0 in the staging and prod environments to kickstart the workflow. Screenshot 2023-05-01 at 12 46 58 PM

Troubleshooting

If any issue occurs from backfilling the data, look through Lambda logs to pinpoint the problem.