Ccozi/temp logging fix (#2249)

* Benchmarking january results. (#2189) * Benchmarking january results. * Update to add MFE job definition files. * Fix phi-2 paths. * Update phi-2 model directory. * Fix boolq phi-2 results path. --------- Co-authored-by: Alex Kalita <[email protected]> * Model card updated for whisper large (#2202) * fix credential-less blob check (#2188) * fix credential-less blob check * add spec_version_upgrader * update component versions * add header and doc string. * add more UT for spec version upgrader * remove trailing whitespace * add missing param. * add null check for client_secret for adlsgen2 datastore --------- Co-authored-by: Richard Li <[email protected]> * upgrading the environment to latest pkgs (#2204) * removing NC series from computes allow list (#2211) * updating model specific defaults and finetune config for mistral model (#2209) * add rai qa quality and safety eval flow (#2208) * add rai qa quality and safety eval flow * add test_config for rai qa quality & safety flow * Check if secrets exist (#2217) * Check if secrets exists * update * Update * add batch allowlist for mistral base model (#2201) * add batch allowlist for mistral base model * format * Fix olive-optimizer vul Jan new (#2200) * Vulnerability fixes for python-sdk-v2 and model-management environment (#2216) * sdk v2 * sdk v2 * sdk v2 * sdk v2 * sdk v2 * sdk v2 * sdk v2 * sdk v2 * sdk v2 * new acpt env for torch2.1 and cuda12.1 (#2186) * new env for cuda12.1 * updated * update rai qa safety flow output format (#2226) * update rai qa safety flow output format * update rai qa quality&saftey flow output format * bump up component version and use azureml-rag 0.2.24.2 in environment (#2225) * Update DBCopilot version (#2220) * Preprocessor custom scipt fix (#2219) * Replaced os.system with subprocess.check_output in dataset_preprocessor method that is used to run custom script. * Replaced os.system with subprocess.check_output in dataset_preprocessor method that is used to run custom script. * Fix llama-2-7b results for truthful-qa (#2229) * stable diffusion XL base model support (#2233) * basexl update * wrapper updates * format update * Make sure we recover details * Upgrade AML Benchmark components (#2236) Co-authored-by: Sarthak Singhal <[email protected]> * add gsq e2e test (#2231) * Update inputs (#2239) * Remove acs stuff in faiss pipeline (#2240) * Add two promptflow models: count-cars and detect-defects (#2070) * Add two promptflow models: count-cars and detect-defects * Add ci test configs for count-cars and detect-defects * Put "connection" into "inputs" for Azure OpenAI GPT-4 Turbo with Vision tool --------- Co-authored-by: Zhi Zhou <[email protected]> * Update DBCopilot promptflow (#2242) * SystemLog: prefix logging * Adding more detailed logging * Ccozianu/rm bug fix (#2247) * add more logs * fix stdout logs * Make sure we recover details * Make sure we recover details (#2238) * SystemLog: prefix logging * Ccozi/temp logging fix (#2246) * Make sure we recover details * SystemLog: prefix logging * Adding more detailed logging --------- Co-authored-by: svaruag <[email protected]> * Fixing typo --------- Co-authored-by: arun-rajora <[email protected]> Co-authored-by: Alex Kalita <[email protected]> Co-authored-by: HrishikeshGeedMS <[email protected]> Co-authored-by: Richard Li <[email protected]> Co-authored-by: Richard Li <[email protected]> Co-authored-by: pmanoj <[email protected]> Co-authored-by: qusongms <[email protected]> Co-authored-by: Ayush Mishra <[email protected]> Co-authored-by: ym11369 <[email protected]> Co-authored-by: savitamittal1 <[email protected]> Co-authored-by: jingyizhu99 <[email protected]> Co-authored-by: XiangRao <[email protected]> Co-authored-by: Nivedita Mishra <[email protected]> Co-authored-by: Ramu Vadthyavath <[email protected]> Co-authored-by: sarthaks95 <[email protected]> Co-authored-by: Sarthak Singhal <[email protected]> Co-authored-by: Ilya Matiach <[email protected]> Co-authored-by: jinzhaochang <[email protected]> Co-authored-by: Zhi Zhou <[email protected]> Co-authored-by: Zhi Zhou <[email protected]> Co-authored-by: svaruag <[email protected]>
Azure · Feb 2, 2024 · 2d747f0 · 2d747f0
1 parent 20ffbd7
commit 2d747f0
Show file tree

Hide file tree

Showing 503 changed files with 34,373 additions and 702 deletions.
diff --git a/.github/workflows/assets-validation.yaml b/.github/workflows/assets-validation.yaml
@@ -54,12 +54,18 @@ jobs:
           python-version: '>=3.8'
 
       - name: Log in to Azure
+        env:
+          # to use in condition
+          client_id: ${{ secrets.AZURE_CLIENT_ID }}
+          tenant_id: ${{ secrets.AZURE_TENANT_ID }}
+          subscription_id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
+        if: env.client_id != '' && env.tenant_id != ''
         uses: azure/login@v1
         with:
-          client-id: ${{ secrets.AZURE_CLIENT_ID }}
-          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
-          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
-  
+          client-id: ${{ env.client_id }}
+          tenant-id: ${{ env.tenant_id }}
+          subscription-id: ${{ env.subscription_id }}
+
       - name: Install dependencies
         run: pip install -e $scripts_azureml_assets_dir
 

diff --git a/assets/aml-benchmark/components/batch-benchmark-inference/spec.yaml b/assets/aml-benchmark/components/batch-benchmark-inference/spec.yaml
@@ -4,7 +4,7 @@ type: pipeline
 name: batch_benchmark_inference
 display_name: Batch Benchmark Inference
 description: Components for batch endpoint inference
-version: 0.0.4
+version: 0.0.5
 
 inputs:
   input_dataset:
@@ -149,7 +149,7 @@ jobs:
   # Preparer
   batch_inference_preparer: 
     type: command
-    component: azureml:batch_inference_preparer:0.0.5
+    component: azureml:batch_inference_preparer:0.0.6
     inputs:
       input_dataset: ${{parent.inputs.input_dataset}}
       model_type: ${{parent.inputs.model_type}}
@@ -167,7 +167,7 @@ jobs:
   # Inference
   endpoint_batch_score:
     type: parallel
-    component: azureml:batch_benchmark_score:0.0.4
+    component: azureml:batch_benchmark_score:0.0.5
     inputs:
       model_type: ${{parent.inputs.model_type}}
       online_endpoint_url: ${{parent.inputs.endpoint_url}}
@@ -199,7 +199,7 @@ jobs:
   # Reformat
   batch_output_formatter: 
     type: command
-    component: azureml:batch_output_formatter:0.0.5
+    component: azureml:batch_output_formatter:0.0.6
     inputs:
       model_type: ${{parent.inputs.model_type}}
       batch_inference_output: ${{parent.jobs.endpoint_batch_score.outputs.mini_batch_results_out_directory}}

diff --git a/assets/aml-benchmark/components/batch-benchmark-score/spec.yaml b/assets/aml-benchmark/components/batch-benchmark-score/spec.yaml
@@ -1,6 +1,6 @@
 $schema: http://azureml/sdk-2-0/ParallelComponent.json
 name: batch_benchmark_score
-version: 0.0.4
+version: 0.0.5
 display_name: Batch Benchmark Score
 is_deterministic: False
 type: parallel

diff --git a/assets/aml-benchmark/components/batch-inference-preparer/spec.yaml b/assets/aml-benchmark/components/batch-inference-preparer/spec.yaml
@@ -4,7 +4,7 @@ type: command
 name: batch_inference_preparer
 display_name: Batch Inference Preparer
 description: Prepare the jsonl file and endpoint for batch inference component.
-version: 0.0.5
+version: 0.0.6
 
 inputs:
   input_dataset: 

diff --git a/assets/aml-benchmark/components/batch-output-formatter/spec.yaml b/assets/aml-benchmark/components/batch-output-formatter/spec.yaml
@@ -1,5 +1,5 @@
 name: batch_output_formatter
-version: 0.0.5
+version: 0.0.6
 display_name: Batch Output Formatter
 is_deterministic: True
 type: command

diff --git a/assets/aml-benchmark/components/batch_benchmark_inference_claude/spec.yaml b/assets/aml-benchmark/components/batch_benchmark_inference_claude/spec.yaml
@@ -4,7 +4,7 @@ type: pipeline
 name: batch_benchmark_inference_claude
 display_name: Batch Benchmark Inference with claude support
 description: Components for batch endpoint inference
-version: 0.0.1
+version: 0.0.2
 
 inputs:
   input_dataset:
@@ -151,7 +151,7 @@ jobs:
   # Preparer
   batch_inference_preparer: 
     type: command
-    component: azureml:batch_inference_preparer:0.0.4
+    component: azureml:batch_inference_preparer:0.0.6
     inputs:
       input_dataset: ${{parent.inputs.input_dataset}}
       model_type: ${{parent.inputs.model_type}}
@@ -168,7 +168,7 @@ jobs:
   # Inference
   endpoint_batch_score:
     type: parallel
-    component: azureml:batch_benchmark_score:0.0.4
+    component: azureml:batch_benchmark_score:0.0.5
     inputs:
       model_type: ${{parent.inputs.model_type}}
       online_endpoint_url: ${{parent.inputs.endpoint_url}}
@@ -199,7 +199,7 @@ jobs:
   # Reformat
   batch_output_formatter: 
     type: command
-    component: azureml:batch_output_formatter:0.0.4
+    component: azureml:batch_output_formatter:0.0.6
     inputs:
       model_type: ${{parent.inputs.model_type}}
       batch_inference_output: ${{parent.jobs.endpoint_batch_score.outputs.mini_batch_results_out_directory}}

diff --git a/assets/aml-benchmark/components/benchmark-result-aggregator/spec.yaml b/assets/aml-benchmark/components/benchmark-result-aggregator/spec.yaml
@@ -4,7 +4,7 @@ type: command
 name: benchmark_result_aggregator
 display_name: Benchmark result aggregator
 description: Aggregate quality metrics, performance metrics and all of the metadata from the pipeline. Also add them to the root run.
-version: 0.0.3
+version: 0.0.4
 is_deterministic: false
 
 inputs:

diff --git a/assets/aml-benchmark/components/compute-performance-metrics/spec.yaml b/assets/aml-benchmark/components/compute-performance-metrics/spec.yaml
@@ -4,7 +4,7 @@ type: command
 name: compute_performance_metrics
 display_name: Compute Performance Metrics
 description: Performs performance metric post processing using data from a model inference run.
-version: 0.0.1
+version: 0.0.2
 is_deterministic: true
 
 inputs:

diff --git a/assets/aml-benchmark/components/dataset-downloader/spec.yaml b/assets/aml-benchmark/components/dataset-downloader/spec.yaml
@@ -4,7 +4,7 @@ type: command
 name: dataset_downloader
 display_name: Dataset Downloader
 description: Downloads the dataset onto blob store.
-version: 0.0.1
+version: 0.0.2
 
 inputs:
   dataset_name: 

diff --git a/assets/aml-benchmark/components/dataset-preprocessor/spec.yaml b/assets/aml-benchmark/components/dataset-preprocessor/spec.yaml
@@ -4,7 +4,7 @@ type: command
 name:  dataset_preprocessor 
 display_name: Dataset Preprocessor
 description: Dataset Preprocessor
-version: 0.0.1
+version: 0.0.2
 is_deterministic: true
 
 inputs:

diff --git a/assets/aml-benchmark/components/dataset-sampler/spec.yaml b/assets/aml-benchmark/components/dataset-sampler/spec.yaml
@@ -4,7 +4,7 @@ type: command
 name: dataset_sampler
 display_name: Dataset Sampler
 description: Samples a dataset containing JSONL file(s).
-version: 0.0.1
+version: 0.0.2
 
 inputs:
   dataset: 

diff --git a/assets/aml-benchmark/components/inference-postprocessor/spec.yaml b/assets/aml-benchmark/components/inference-postprocessor/spec.yaml
@@ -4,7 +4,7 @@ type: command
 name: inference_postprocessor
 display_name: Inference Postprocessor
 description: Inference Postprocessor
-version: 0.0.2
+version: 0.0.3
 is_deterministic: true
 
 inputs:

diff --git a/assets/aml-benchmark/components/prompt_crafter/spec.yaml b/assets/aml-benchmark/components/prompt_crafter/spec.yaml
@@ -6,7 +6,7 @@ display_name: Prompt Crafter
 description: This component is used to create prompts from a given dataset. From a 
   given jinja prompt template, it will generate prompts. It can also create 
   few-shot prompts given a few-shot dataset and the number of shots.
-version: 0.0.4
+version: 0.0.5
 is_deterministic: true
 
 inputs:

diff --git a/assets/aml-benchmark/components/src/aml_benchmark/batch_resource_manager/main.py b/assets/aml-benchmark/components/src/aml_benchmark/batch_resource_manager/main.py
@@ -445,7 +445,7 @@ def main(
                 AzureMLError.create(
                     BenchmarkUserError,
                     error_details=f"{retries_err_msg} Details: {BufferStore.get_all_data()}"
-                )
+                ))
     elif delete_managed_deployment:
         if not deployment_metadata:
             logger.info("Delete deployment using input parameters.")

diff --git a/...s/aml-benchmark/components/src/aml_benchmark/dataset_preprocessor/dataset_preprocessor.py b/...s/aml-benchmark/components/src/aml_benchmark/dataset_preprocessor/dataset_preprocessor.py
@@ -6,9 +6,9 @@
 """DataPreprocessor class and runner."""
 
 import json
-import os
 import re
 import jinja2
+import subprocess
 
 from azureml._common._error_definition.azureml_error import AzureMLError
 from aml_benchmark.utils.exceptions import BenchmarkValidationException, BenchmarkUserException
@@ -144,13 +144,17 @@ def run(self) -> None:
             return
 
     def run_user_preprocessor(self) -> None:
-        """Prerpocessor run using custom template."""
+        """Preprocessor run using custom script."""
         try:
-            os.system(
-                f'python {self.user_preprocessor} --input_path {self.input_dataset} \
-                --output_path {self.output_dataset}'
+            _ = subprocess.check_output(
+                f"python {self.user_preprocessor} --input_path {self.input_dataset} \
+                    --output_path {self.output_dataset}",
+                stderr=subprocess.STDOUT,
+                universal_newlines=True,
+                shell=True,
             )
-        except Exception as e:
+        except subprocess.CalledProcessError as e:
+            error_message = e.output.strip()
             raise BenchmarkUserException._with_error(
-                AzureMLError.create(BenchmarkUserError, error_details=e)
+                AzureMLError.create(BenchmarkUserError, error_details=error_message)
             )
diff --git a/.../aml-benchmark/components/src/aml_benchmark/utils/online_endpoint/aoai_online_endpoint.py b/.../aml-benchmark/components/src/aml_benchmark/utils/online_endpoint/aoai_online_endpoint.py
@@ -142,6 +142,7 @@ def create_deployment(self):
             payload['properties']["versionUpgradeOption"] = "OnceNewDefaultVersionAvailable"
             payload['properties']["raiPolicyName"] = "Microsoft.Default"
         resp = self._call_endpoint(get_requests_session().put, self._aoai_deployment_url, payload=payload)
+        logger.info(f"Calling(PUT) {self._aoai_deployment_url} returned {resp.status_code} with content {resp.content}.")
         self._raise_if_not_success(resp)
         logger.info("Calling(PUT) {} returned {} with content {}.".format(
             self._aoai_deployment_url, resp.status_code, self._get_content_from_response(resp)))

diff --git a/...aml-benchmark/components/src/aml_benchmark/utils/online_endpoint/online_endpoint_model.py b/...aml-benchmark/components/src/aml_benchmark/utils/online_endpoint/online_endpoint_model.py
@@ -77,7 +77,7 @@ def model_version(self) -> str:
                 finetuned_run = get_dependent_run(self.model_depend_step)
                 ws = Run.get_context().experiment.workspace
                 finetuned_run_id = self._get_model_registered_run_id(finetuned_run)
-                logger.info(f"Finetuned run id is {finetuned_run_id}")
+            logger.info(f"Searching for model in worskpace {ws} run_id={finetuned_run_id} is {self._model_name}")
             models = list(Model.list(ws, self._model_name, run_id=finetuned_run_id))
             if len(models) == 0:
                 raise BenchmarkUserException._with_error(

diff --git a/assets/common/environments/python-sdk-v2/context/Dockerfile b/assets/common/environments/python-sdk-v2/context/Dockerfile
@@ -1,6 +1,6 @@
 FROM mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:{{latest-image-tag}}
 
-RUN apt-get update -y && apt-get install libc-bin libc-dev-bin libc6 libc6-dev libcurl3-gnutls tar openssh-server openssh-client openssl curl -y
+RUN apt-get update -y && apt-get install binutils libssh-4 libsqlite3-0 libpam-modules linux-libc-dev libldap-common libldap-2.4-2 libc-bin libc-dev-bin libc6 libc6-dev libcurl3-gnutls libgnutls30 tar openssh-server openssh-client openssl curl -y
 
 WORKDIR /
 
@@ -17,3 +17,4 @@ RUN conda env create -p $CONDA_PREFIX -f conda_dependencies.yaml -q && \
     rm conda_dependencies.yaml && \
     conda run -p $CONDA_PREFIX pip cache purge && \
     conda clean -a -y
+
diff --git a/assets/common/environments/python-sdk-v2/context/conda_dependencies.yaml b/assets/common/environments/python-sdk-v2/context/conda_dependencies.yaml
@@ -3,7 +3,7 @@ channels:
   - conda-forge
 dependencies:
   - python=3.8
-  - pip=21.2.4
+  - pip=23.3
   - pip:
     - azure-ai-ml=={{latest-pypi-version}}
     - azure-identity=={{latest-pypi-version}}
@@ -13,4 +13,5 @@ dependencies:
     - azureml-telemetry=={{latest-pypi-version}}
     - cryptography=={{latest-pypi-version}}
     - certifi=={{latest-pypi-version}}
-    - urllib3=={{latest-pypi-version}}
+    - urllib3=={{latest-pypi-version}}
+    - paramiko=={{latest-pypi-version}}
diff --git a/assets/evaluation_results/boolq_gpt_35_turbo_0301_question_answering/spec.yaml b/assets/evaluation_results/boolq_gpt_35_turbo_0301_question_answering/spec.yaml
@@ -1,6 +1,6 @@
 type: evaluationresult
 name: boolq_gpt_35_turbo_0301_question_answering
-version: 1.0.1
+version: 1.0.2
 display_name: boolq_gpt_35_turbo_0301_question_answering
 description: gpt-35-turbo-0301 run for boolq dataset
 dataset_family: boolq

diff --git a/assets/evaluation_results/boolq_gpt_35_turbo_0613_question_answering/spec.yaml b/assets/evaluation_results/boolq_gpt_35_turbo_0613_question_answering/spec.yaml
@@ -1,6 +1,6 @@
 type: evaluationresult
 name: boolq_gpt_35_turbo_0613_question_answering
-version: 1.0.1
+version: 1.0.2
 display_name: boolq_gpt_35_turbo_0613_question_answering
 description: gpt-35-turbo-0613 run for boolq dataset
 dataset_family: boolq

diff --git a/assets/evaluation_results/boolq_gpt_4_0314_question_answering/spec.yaml b/assets/evaluation_results/boolq_gpt_4_0314_question_answering/spec.yaml
@@ -1,6 +1,6 @@
 type: evaluationresult
 name: boolq_gpt_4_0314_question_answering
-version: 1.0.1
+version: 1.0.2
 display_name: boolq_gpt_4_0314_question_answering
 description: gpt-4-0314 run for boolq dataset
 dataset_family: boolq

diff --git a/assets/evaluation_results/boolq_gpt_4_0613_question_answering/spec.yaml b/assets/evaluation_results/boolq_gpt_4_0613_question_answering/spec.yaml
@@ -1,6 +1,6 @@
 type: evaluationresult
 name: boolq_gpt_4_0613_question_answering
-version: 1.0.1
+version: 1.0.2
 display_name: boolq_gpt_4_0613_question_answering
 description: gpt-4-0613 run for boolq dataset
 dataset_family: boolq

diff --git a/assets/evaluation_results/boolq_gpt_4_32k_0314_question_answering/spec.yaml b/assets/evaluation_results/boolq_gpt_4_32k_0314_question_answering/spec.yaml
@@ -1,6 +1,6 @@
 type: evaluationresult
 name: boolq_gpt_4_32k_0314_question_answering
-version: 1.0.1
+version: 1.0.2
 display_name: boolq_gpt_4_32k_0314_question_answering
 description: gpt-4-32k-0314 run for boolq dataset
 dataset_family: boolq

diff --git a/assets/evaluation_results/boolq_gpt_4_32k_0613_question_answering/spec.yaml b/assets/evaluation_results/boolq_gpt_4_32k_0613_question_answering/spec.yaml
@@ -1,6 +1,6 @@
 type: evaluationresult
 name: boolq_gpt_4_32k_0613_question_answering
-version: 1.0.1
+version: 1.0.2
 display_name: boolq_gpt_4_32k_0613_question_answering
 description: gpt-4-32k-0613 run for boolq dataset
 dataset_family: boolq

diff --git a/assets/evaluation_results/boolq_llama_2_13b_chat_question_answering/spec.yaml b/assets/evaluation_results/boolq_llama_2_13b_chat_question_answering/spec.yaml
@@ -1,6 +1,6 @@
 type: evaluationresult
 name: boolq_llama_2_13b_chat_question_answering
-version: 1.0.1
+version: 1.0.2
 display_name: boolq_llama_2_13b_chat_question_answering
 description: llama-2-13b-chat run for boolq dataset
 dataset_family: boolq

diff --git a/assets/evaluation_results/boolq_llama_2_13b_question_answering/spec.yaml b/assets/evaluation_results/boolq_llama_2_13b_question_answering/spec.yaml
@@ -1,6 +1,6 @@
 type: evaluationresult
 name: boolq_llama_2_13b_question_answering
-version: 1.0.1
+version: 1.0.2
 display_name: boolq_llama_2_13b_question_answering
 description: llama-2-13b run for boolq dataset
 dataset_family: boolq

diff --git a/assets/evaluation_results/boolq_llama_2_70b_chat_question_answering/spec.yaml b/assets/evaluation_results/boolq_llama_2_70b_chat_question_answering/spec.yaml
@@ -1,6 +1,6 @@
 type: evaluationresult
 name: boolq_llama_2_70b_chat_question_answering
-version: 1.0.1
+version: 1.0.2
 display_name: boolq_llama_2_70b_chat_question_answering
 description: llama-2-70b-chat run for boolq dataset
 dataset_family: boolq

diff --git a/assets/evaluation_results/boolq_llama_2_70b_question_answering/spec.yaml b/assets/evaluation_results/boolq_llama_2_70b_question_answering/spec.yaml
@@ -1,6 +1,6 @@
 type: evaluationresult
 name: boolq_llama_2_70b_question_answering
-version: 1.0.1
+version: 1.0.2
 display_name: boolq_llama_2_70b_question_answering
 description: llama-2-70b run for boolq dataset
 dataset_family: boolq

diff --git a/assets/evaluation_results/boolq_llama_2_7b_chat_question_answering/spec.yaml b/assets/evaluation_results/boolq_llama_2_7b_chat_question_answering/spec.yaml
@@ -1,6 +1,6 @@
 type: evaluationresult
 name: boolq_llama_2_7b_chat_question_answering
-version: 1.0.1
+version: 1.0.2
 display_name: boolq_llama_2_7b_chat_question_answering
 description: llama-2-7b-chat run for boolq dataset
 dataset_family: boolq

diff --git a/assets/evaluation_results/boolq_llama_2_7b_question_answering/spec.yaml b/assets/evaluation_results/boolq_llama_2_7b_question_answering/spec.yaml
@@ -1,6 +1,6 @@
 type: evaluationresult
 name: boolq_llama_2_7b_question_answering
-version: 1.0.1
+version: 1.0.2
 display_name: boolq_llama_2_7b_question_answering
 description: llama-2-7b run for boolq dataset
 dataset_family: boolq

diff --git a/assets/evaluation_results/boolq_microsoft_phi_2_question_answering/asset.yaml b/assets/evaluation_results/boolq_microsoft_phi_2_question_answering/asset.yaml
@@ -0,0 +1,3 @@
+type: evaluationresult
+spec: spec.yaml
+categories: ["EvaluationResult"]