-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PerformanceCalculator fails to initialize on docker container #435
Comments
yaml file looks deformed here but I checked it with yamllint its correct |
Will take a look at it next week! |
@nielsn Have you had a chance to look at it? I attempted to use several previous versions of the image, but I encountered the same errors. I also tried building the image from the repo, but I ran into dependency issues. I am eager to test this setup because I believe it is easier and more robust than the SM model monitor. However, I will need to move on if I can't get nannyml container to work. |
Could you try and get me the proper formatted YAML file? It has to do something with the data hierarchy in the configuration file. Sorry about not having a proper example in the docs. |
looks like it's still being deformed when I paste it here. |
It was also complaining that the target dataset was needed even though the docs said that it was optional. |
Hey @j0bekt01 , I've done a bit of digging into the The main idea is that a column mapping is no longer supported, but any calculator can be passed along with the parameters required for its construction. For example, after replacing the dataset paths with our example ones: input:
reference_data:
path: https://github.com/NannyML/nannyml/raw/refs/heads/main/nannyml/datasets/data/mc_reference.csv
credentials: {}
read_args: {}
analysis_data:
path: https://github.com/NannyML/nannyml/raw/refs/heads/main/nannyml/datasets/data/mc_analysis.csv
credentials: {}
read_args: {}
target_data:
path: https://github.com/NannyML/nannyml/raw/refs/heads/main/nannyml/datasets/data/mc_analysis_gt.csv
credentials: {}
read_args: {}
join_column: id
calculators:
- type: performance
outputs:
- type: raw_files
params:
path: output/
write_args:
filename: performance_metrics.csv
format: csv
store:
path: output/store
filename: performance_metrics.pkl
params:
metrics:
- roc_auc
- accuracy
- f1
- precision
- recall
y_true: y_true
y_pred: y_pred
y_pred_proba:
- prepaid_card: y_pred_proba_prepaid_card
- highstreet_card: y_pred_proba_highstreet_card
- upmarket_card: y_pred_proba_upmarket_card
problem_type: classification_multiclass
timestamp_column_name: timestamp
chunk_period: "D"
problem_type: classification_multiclass
ignore_errors: true As for your question about the target values: in the case of realized performance the target data is actually required. Without it you can't calculate the realized performance. If you don't have that data available you should estimate the performance instead using e.g. CBPE. Hope this helps, don't hesitate to ask if you need additional help to get this bit working. |
Thanks! I apreciate it. I should have some time to test it out today and will let you know how it goes. |
@nnansters I'm curious about your thoughts on the setup I'm planning to implement and would appreciate your advice on optimizing it. My goal is to monitor models built in SageMaker, with artifacts stored in S3. Here's my envisioned workflow:
The requirements are that it must run on EC2, and I have to use AWS Prometheus and Grafana. I believe my requirements are fairly flexible, but I want to ensure the setup is optimal. What do you think? |
Is there a complete sample of a nann.yml file that can be used with a Docker container? I have been trying all day to get a simple multi-class classification to work with the container but keep getting errors indicating that the calculator is missing arguments, even though I have added them to the YAML file.
`---
input:
reference_data:
path: s3://working/user-data/nannyml/reference_df.parquet
credentials: {}
read_args: {}
analysis_data:
path: s3://working/user-data/nannyml/analysis_df.parquet
credentials: {}
read_args: {}
target_data:
path: s3://working/user-data/nannyml/target_df.parquet
credentials: {}
read_args: {}
join_column: id
output:
raw_files:
path: s3://working/user-data/nannyml/output/
format: parquet
column_mapping:
reference:
features:
- acq_channel
- app_behavioral_score
- requested_credit_limit
- app_channel
- credit_bureau_score
- stated_income
- is_customer
timestamp: timestamp
y_true: y_true
y_pred: y_pred
y_pred_proba:
- y_pred_proba_prepaid_card
- y_pred_proba_highstreet_card
- y_pred_proba_upmarket_card
analysis:
features:
- acq_channel
- app_behavioral_score
- requested_credit_limit
- app_channel
- credit_bureau_score
- stated_income
- is_customer
timestamp: timestamp
y_pred: y_pred
y_pred_proba:
- y_pred_proba_prepaid_card
- y_pred_proba_highstreet_card
- y_pred_proba_upmarket_card
calculators:
metrics:
y_true: y_true
y_pred: y_pred
problem_type: classification_multiclass
timestamp: timestamp
outputs: []
store:
path: s3://working/user-data/nannyml/nannyml/store/
credentials: {}
filename: performance_metrics.parquet
params: {}
problem_type: classification_multiclass
ignore_errors: True`
the sample datasets are generated using
nml.load_synthetic_multiclass_classification_dataset()
here is my docker command
docker run -v ./nannyml/config/:/config/ nannyml/nannyml:0.11.0 nml run
loading configuration file from /config/nann.yml cli.py:35 no scheduler configured, performing one-off run run.py:43 read 840000 rows from runner.py:109 s3://working/user-data/nannyml/reference_df.parquet read 780000 rows from runner.py:109 s3://working/user-data/nannyml/nannyml/analysis_df.parquet read 120000 rows from runner.py:109 s3://working/user-data/nannyml/nannyml/target_df.parquet [1/3] 'performance': loading calculator from store runner.py:109 an unexpected exception occurred running 'performance': runner.py:109 PerformanceCalculator.__init__() missing 3 required positional arguments: 'metrics', 'y_true', and 'problem_type'
The text was updated successfully, but these errors were encountered: