Skip to content

Files

Latest commit

b81f4d1 · Dec 24, 2024

History

History
This branch is 7 commits behind teaxyz/chai:main.

core

Core Tools for CHAI Python Loaders

This directory contains a set of core tools and utilities to facilitate loading the CHAI database with package manager data, using python helpers. These tools provide a common foundation for fetching, transforming, and loading data from various package managers into the database.

Key Components

Config always runs first, and is the entrypoint for all loaders. It includes;

  • Execution flags:
    • FETCH determines whether we request the data from source
    • TEST enables a test mode, to test specific portions of the pipeline
    • NO_CACHE to determine whether we save the intermediate pipeline files
  • Package Manager flags
    • pm_id gets the package manager id from the db, that we'd run the pipeline for
    • source is the data source for that package manager. SOURCES defines the map.

The next 3 configuration classes retrieve the IDs for url types (homepage, documentation, etc.), dependency types (build, runtime, etc.) and user types (crates user, github user)

The DB class offers a set of methods for interacting with the database, including:

  • Inserting and selecting data for packages, versions, users, dependencies, and more
  • Caching mechanisms to improve performance
  • Batch processing capabilities for efficient data insertion

The Fetcher class provides functionality for downloading and extracting data from package manager sources. It supports:

  • Downloading tarball files
  • Extracting contents to a specified directory
  • Maintaining a "latest" symlink so we always know where to look

A custom logging utility that provides consistent logging across all loaders.

SQLAlchemy models representing the database schema, including:

  • Package, Version, User, License, DependsOn, and other relevant tables

Note

This is currently used to actually generate the migrations as well

A scheduling utility that allows loaders to run at specified intervals.

The Transformer class provides a base for creating package manager-specific transformers. It includes:

  • Methods for locating and reading input files
  • Placeholder methods for transforming data into the required format

Usage

To create a new loader for a package manager:

  1. Create a new directory under package_managers/ for your package manager.
  2. Implement a fetcher that inherits from the base Fetcher, that is able to fetch the raw data from the package manager's source.
  3. Implement a custom Transformer class that inherits from the base Transformer, that figures out how to map the raw data provided by the package managers into the data model described in the models module.
  4. Create a main script that utilizes the core components (Config, DB, Fetcher, Transformer, Scheduler) to fetch, transform, and load data.

Example usage can be found in the crates loader.