PerformanceCalculator fails to initialize on docker container #435

j0bekt01 · 2025-01-17T22:34:01Z

Is there a complete sample of a nann.yml file that can be used with a Docker container? I have been trying all day to get a simple multi-class classification to work with the container but keep getting errors indicating that the calculator is missing arguments, even though I have added them to the YAML file.
`---
input:
reference_data:
path: s3://working/user-data/nannyml/reference_df.parquet
credentials: {}
read_args: {}
analysis_data:
path: s3://working/user-data/nannyml/analysis_df.parquet
credentials: {}
read_args: {}
target_data:
path: s3://working/user-data/nannyml/target_df.parquet
credentials: {}
read_args: {}
join_column: id
output:
raw_files:
path: s3://working/user-data/nannyml/output/
format: parquet

column_mapping:
reference:
features:
- acq_channel
- app_behavioral_score
- requested_credit_limit
- app_channel
- credit_bureau_score
- stated_income
- is_customer
timestamp: timestamp
y_true: y_true
y_pred: y_pred
y_pred_proba:
- y_pred_proba_prepaid_card
- y_pred_proba_highstreet_card
- y_pred_proba_upmarket_card

analysis:
features:
- acq_channel
- app_behavioral_score
- requested_credit_limit
- app_channel
- credit_bureau_score
- stated_income
- is_customer
timestamp: timestamp
y_pred: y_pred
y_pred_proba:
- y_pred_proba_prepaid_card
- y_pred_proba_highstreet_card
- y_pred_proba_upmarket_card

calculators:

type: performance
metrics:
- roc_auc
- accuracy
- f1
- precision
- recall
  y_true: y_true
  y_pred: y_pred
  problem_type: classification_multiclass
  timestamp: timestamp
  outputs: []
  store:
  path: s3://working/user-data/nannyml/nannyml/store/
  credentials: {}
  filename: performance_metrics.parquet
  params: {}

problem_type: classification_multiclass
ignore_errors: True`

the sample datasets are generated using nml.load_synthetic_multiclass_classification_dataset()

here is my docker command docker run -v ./nannyml/config/:/config/ nannyml/nannyml:0.11.0 nml run

loading configuration file from /config/nann.yml cli.py:35 no scheduler configured, performing one-off run run.py:43 read 840000 rows from runner.py:109 s3://working/user-data/nannyml/reference_df.parquet read 780000 rows from runner.py:109 s3://working/user-data/nannyml/nannyml/analysis_df.parquet read 120000 rows from runner.py:109 s3://working/user-data/nannyml/nannyml/target_df.parquet [1/3] 'performance': loading calculator from store runner.py:109 an unexpected exception occurred running 'performance': runner.py:109 PerformanceCalculator.__init__() missing 3 required positional arguments: 'metrics', 'y_true', and 'problem_type'

The text was updated successfully, but these errors were encountered:

j0bekt01 · 2025-01-17T22:35:22Z

yaml file looks deformed here but I checked it with yamllint its correct

nielsn · 2025-01-17T22:37:08Z

Will take a look at it next week!

j0bekt01 · 2025-01-22T17:21:45Z

@nielsn Have you had a chance to look at it? I attempted to use several previous versions of the image, but I encountered the same errors. I also tried building the image from the repo, but I ran into dependency issues. I am eager to test this setup because I believe it is easier and more robust than the SM model monitor. However, I will need to move on if I can't get nannyml container to work.

nnansters · 2025-01-22T17:36:24Z

Could you try and get me the proper formatted YAML file? It has to do something with the data hierarchy in the configuration file. Sorry about not having a proper example in the docs.

j0bekt01 · 2025-01-22T18:11:14Z

looks like it's still being deformed when I paste it here.
I have uploaded it to my git here: https://github.com/j0bekt01/misc/blob/main/nann.yml

j0bekt01 · 2025-01-22T18:14:32Z

It was also complaining that the target dataset was needed even though the docs said that it was optional.

nnansters · 2025-01-22T23:33:37Z

Hey @j0bekt01 ,

I've done a bit of digging into the config.py file and was able to make the example work.
The following config looks quite different from yours, I'm afraid our docs on this part are sorely neglected.

The main idea is that a column mapping is no longer supported, but any calculator can be passed along with the parameters required for its construction.

For example, after replacing the dataset paths with our example ones:

input:
  reference_data:
    path: https://github.com/NannyML/nannyml/raw/refs/heads/main/nannyml/datasets/data/mc_reference.csv
    credentials: {}
    read_args: {}
  analysis_data:
    path: https://github.com/NannyML/nannyml/raw/refs/heads/main/nannyml/datasets/data/mc_analysis.csv
    credentials: {}
    read_args: {}
  target_data:
    path: https://github.com/NannyML/nannyml/raw/refs/heads/main/nannyml/datasets/data/mc_analysis_gt.csv
    credentials: {}
    read_args: {}
    join_column: id

calculators:
  - type: performance
    outputs:
      - type: raw_files
        params:
          path: output/
        write_args:
          filename: performance_metrics.csv
          format: csv
    store:
      path: output/store
      filename: performance_metrics.pkl
    params:
      metrics:
        - roc_auc
        - accuracy
        - f1
        - precision
        - recall
      y_true: y_true
      y_pred: y_pred
      y_pred_proba:
        - prepaid_card: y_pred_proba_prepaid_card
        - highstreet_card: y_pred_proba_highstreet_card
        - upmarket_card: y_pred_proba_upmarket_card
      problem_type: classification_multiclass
      timestamp_column_name: timestamp
      chunk_period: "D"

problem_type: classification_multiclass

ignore_errors: true

As for your question about the target values: in the case of realized performance the target data is actually required. Without it you can't calculate the realized performance. If you don't have that data available you should estimate the performance instead using e.g. CBPE.

Hope this helps, don't hesitate to ask if you need additional help to get this bit working.

j0bekt01 · 2025-01-23T14:53:16Z

Thanks! I apreciate it. I should have some time to test it out today and will let you know how it goes.

j0bekt01 · 2025-01-23T15:09:54Z

@nnansters I'm curious about your thoughts on the setup I'm planning to implement and would appreciate your advice on optimizing it. My goal is to monitor models built in SageMaker, with artifacts stored in S3. Here's my envisioned workflow:

Once models are deployed, inference data will be captured and aggregated daily to create an analysis dataset.
These datasets will be passed to the nannyML container along with the model-specific configuration.
The output will be pushed to Prometheus and subsequently made available in AWS Managed Prometheus to be consumed by AWS Managed Grafana.

The requirements are that it must run on EC2, and I have to use AWS Prometheus and Grafana. I believe my requirements are fairly flexible, but I want to ensure the setup is optimal. What do you think?

j0bekt01 added bug Something isn't working triage Needs to be assessed labels Jan 17, 2025

j0bekt01 assigned nnansters Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PerformanceCalculator fails to initialize on docker container #435

PerformanceCalculator fails to initialize on docker container #435

j0bekt01 commented Jan 17, 2025

j0bekt01 commented Jan 17, 2025

nielsn commented Jan 17, 2025

j0bekt01 commented Jan 22, 2025

nnansters commented Jan 22, 2025

j0bekt01 commented Jan 22, 2025 •

edited

Loading

j0bekt01 commented Jan 22, 2025

nnansters commented Jan 22, 2025

j0bekt01 commented Jan 23, 2025

j0bekt01 commented Jan 23, 2025 •

edited

Loading

PerformanceCalculator fails to initialize on docker container #435

PerformanceCalculator fails to initialize on docker container #435

Comments

j0bekt01 commented Jan 17, 2025

j0bekt01 commented Jan 17, 2025

nielsn commented Jan 17, 2025

j0bekt01 commented Jan 22, 2025

nnansters commented Jan 22, 2025

j0bekt01 commented Jan 22, 2025 • edited Loading

j0bekt01 commented Jan 22, 2025

nnansters commented Jan 22, 2025

j0bekt01 commented Jan 23, 2025

j0bekt01 commented Jan 23, 2025 • edited Loading

j0bekt01 commented Jan 22, 2025 •

edited

Loading

j0bekt01 commented Jan 23, 2025 •

edited

Loading