-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
(fix) dockerfile path changed so it can be used
Signed-off-by: Kannav02 <[email protected]>
- Loading branch information
Showing
1 changed file
with
7 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
c4161ef
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
===================================
==> Dataset: EDA Corpus
==> Running tests for agent-retriever
/home/luarss/actions-runner/_work/ORAssistant/ORAssistant/evaluation/.venv/lib/python3.12/site-packages/deepeval/init.py:49: UserWarning: You are using deepeval version 1.4.9, however version 2.0 is available. You should consider upgrading via the "pip install --upgrade deepeval" command.
warnings.warn(
Fetching 2 files: 0%| | 0/2 [00:00<?, ?it/s]
Fetching 2 files: 50%|█████ | 1/2 [00:00<00:00, 5.24it/s]
Fetching 2 files: 100%|██████████| 2/2 [00:00<00:00, 8.68it/s]
Evaluating: 0%| | 0/100 [00:00<?, ?it/s]
Evaluating: 1%| | 1/100 [00:12<20:27, 12.40s/it]
Evaluating: 2%|▏ | 2/100 [00:21<17:10, 10.52s/it]
Evaluating: 3%|▎ | 3/100 [00:33<18:10, 11.24s/it]
Evaluating: 4%|▍ | 4/100 [00:42<16:30, 10.32s/it]
Evaluating: 5%|▌ | 5/100 [00:52<16:09, 10.21s/it]
Evaluating: 6%|▌ | 6/100 [01:02<16:03, 10.25s/it]
Evaluating: 7%|▋ | 7/100 [01:14<16:31, 10.66s/it]
Evaluating: 8%|▊ | 8/100 [01:26<16:57, 11.06s/it]
Evaluating: 9%|▉ | 9/100 [01:43<19:28, 12.84s/it]
Evaluating: 10%|█ | 10/100 [01:57<19:59, 13.33s/it]
Evaluating: 11%|█ | 11/100 [02:08<18:38, 12.56s/it]
Evaluating: 12%|█▏ | 12/100 [02:17<17:05, 11.65s/it]
Evaluating: 13%|█▎ | 13/100 [02:30<17:06, 11.80s/it]
Evaluating: 14%|█▍ | 14/100 [02:41<16:47, 11.72s/it]
Evaluating: 15%|█▌ | 15/100 [02:53<16:47, 11.85s/it]
Evaluating: 16%|█▌ | 16/100 [03:04<16:06, 11.51s/it]
Evaluating: 17%|█▋ | 17/100 [03:16<16:05, 11.63s/it]
Evaluating: 18%|█▊ | 18/100 [03:28<15:58, 11.69s/it]
Evaluating: 19%|█▉ | 19/100 [03:38<15:13, 11.28s/it]
Evaluating: 20%|██ | 20/100 [03:50<15:12, 11.41s/it]
Evaluating: 21%|██ | 21/100 [04:01<14:54, 11.33s/it]
Evaluating: 22%|██▏ | 22/100 [04:13<14:51, 11.43s/it]
Evaluating: 23%|██▎ | 23/100 [04:23<14:23, 11.22s/it]
Evaluating: 24%|██▍ | 24/100 [04:34<13:50, 10.93s/it]
Evaluating: 25%|██▌ | 25/100 [04:44<13:18, 10.65s/it]
Evaluating: 26%|██▌ | 26/100 [04:53<12:44, 10.33s/it]
Evaluating: 27%|██▋ | 27/100 [05:05<13:03, 10.73s/it]
Evaluating: 28%|██▊ | 28/100 [05:16<13:11, 10.99s/it]
Evaluating: 29%|██▉ | 29/100 [05:28<13:09, 11.12s/it]
Evaluating: 30%|███ | 30/100 [05:41<13:33, 11.63s/it]
Evaluating: 31%|███ | 31/100 [05:52<13:14, 11.52s/it]
Evaluating: 32%|███▏ | 32/100 [06:03<13:04, 11.54s/it]
Evaluating: 33%|███▎ | 33/100 [06:15<12:51, 11.51s/it]
Evaluating: 34%|███▍ | 34/100 [06:26<12:32, 11.40s/it]
Evaluating: 35%|███▌ | 35/100 [06:38<12:37, 11.66s/it]
Evaluating: 36%|███▌ | 36/100 [06:49<12:06, 11.35s/it]
Evaluating: 37%|███▋ | 37/100 [06:59<11:33, 11.01s/it]
Evaluating: 38%|███▊ | 38/100 [07:10<11:28, 11.10s/it]
Evaluating: 39%|███▉ | 39/100 [07:21<10:58, 10.79s/it]
Evaluating: 40%|████ | 40/100 [07:32<11:03, 11.06s/it]
Evaluating: 41%|████ | 41/100 [07:44<11:01, 11.20s/it]
Evaluating: 42%|████▏ | 42/100 [07:57<11:16, 11.67s/it]
Evaluating: 43%|████▎ | 43/100 [08:08<11:05, 11.67s/it]
Evaluating: 44%|████▍ | 44/100 [08:18<10:23, 11.13s/it]
Evaluating: 45%|████▌ | 45/100 [08:30<10:30, 11.46s/it]
Evaluating: 46%|████▌ | 46/100 [08:43<10:38, 11.83s/it]
Evaluating: 47%|████▋ | 47/100 [08:58<11:12, 12.70s/it]
Evaluating: 48%|████▊ | 48/100 [09:09<10:31, 12.14s/it]
Evaluating: 49%|████▉ | 49/100 [09:18<09:45, 11.47s/it]
Evaluating: 50%|█████ | 50/100 [09:31<09:54, 11.89s/it]
Evaluating: 51%|█████ | 51/100 [09:43<09:40, 11.85s/it]
Evaluating: 52%|█████▏ | 52/100 [09:55<09:26, 11.80s/it]
Evaluating: 53%|█████▎ | 53/100 [10:06<09:03, 11.56s/it]
Evaluating: 54%|█████▍ | 54/100 [10:18<09:03, 11.82s/it]
Evaluating: 55%|█████▌ | 55/100 [10:29<08:36, 11.48s/it]
Evaluating: 56%|█████▌ | 56/100 [10:39<08:08, 11.11s/it]
Evaluating: 57%|█████▋ | 57/100 [10:50<07:53, 11.02s/it]
Evaluating: 58%|█████▊ | 58/100 [11:01<07:41, 10.98s/it]
Evaluating: 59%|█████▉ | 59/100 [11:12<07:31, 11.00s/it]
Evaluating: 60%|██████ | 60/100 [11:24<07:30, 11.27s/it]
Evaluating: 61%|██████ | 61/100 [11:35<07:17, 11.21s/it]
Evaluating: 62%|██████▏ | 62/100 [11:46<07:05, 11.20s/it]
Evaluating: 63%|██████▎ | 63/100 [11:56<06:45, 10.95s/it]
Evaluating: 64%|██████▍ | 64/100 [12:07<06:27, 10.75s/it]
Evaluating: 65%|██████▌ | 65/100 [12:18<06:22, 10.93s/it]
Evaluating: 66%|██████▌ | 66/100 [12:35<07:08, 12.61s/it]
Evaluating: 67%|██████▋ | 67/100 [12:46<06:41, 12.15s/it]
Evaluating: 68%|██████▊ | 68/100 [12:56<06:06, 11.47s/it]
Evaluating: 69%|██████▉ | 69/100 [13:07<05:54, 11.44s/it]
Evaluating: 70%|███████ | 70/100 [13:17<05:29, 10.99s/it]
Evaluating: 71%|███████ | 71/100 [13:28<05:21, 11.09s/it]
Evaluating: 72%|███████▏ | 72/100 [13:38<05:03, 10.85s/it]
Evaluating: 73%|███████▎ | 73/100 [13:49<04:51, 10.79s/it]
Evaluating: 74%|███████▍ | 74/100 [14:01<04:48, 11.10s/it]
Evaluating: 75%|███████▌ | 75/100 [14:12<04:39, 11.18s/it]
Evaluating: 76%|███████▌ | 76/100 [14:25<04:42, 11.78s/it]
Evaluating: 77%|███████▋ | 77/100 [14:37<04:28, 11.70s/it]
Evaluating: 78%|███████▊ | 78/100 [14:48<04:15, 11.63s/it]
Evaluating: 79%|███████▉ | 79/100 [14:58<03:50, 10.97s/it]
Evaluating: 80%|████████ | 80/100 [15:08<03:34, 10.75s/it]
Evaluating: 81%|████████ | 81/100 [15:18<03:21, 10.63s/it]
Evaluating: 82%|████████▏ | 82/100 [15:29<03:11, 10.66s/it]
Evaluating: 83%|████████▎ | 83/100 [15:39<02:57, 10.42s/it]
Evaluating: 84%|████████▍ | 84/100 [15:50<02:48, 10.54s/it]
Evaluating: 85%|████████▌ | 85/100 [15:59<02:32, 10.18s/it]
Evaluating: 86%|████████▌ | 86/100 [16:11<02:31, 10.81s/it]
Evaluating: 87%|████████▋ | 87/100 [16:23<02:21, 10.91s/it]
Evaluating: 88%|████████▊ | 88/100 [16:34<02:13, 11.14s/it]
Evaluating: 89%|████████▉ | 89/100 [16:45<02:01, 11.04s/it]
Evaluating: 90%|█████████ | 90/100 [16:57<01:54, 11.41s/it]
Evaluating: 91%|█████████ | 91/100 [17:11<01:48, 12.09s/it]
Evaluating: 92%|█████████▏| 92/100 [17:24<01:38, 12.28s/it]
Evaluating: 93%|█████████▎| 93/100 [17:36<01:26, 12.34s/it]
Evaluating: 94%|█████████▍| 94/100 [17:46<01:09, 11.63s/it]
Evaluating: 95%|█████████▌| 95/100 [17:57<00:56, 11.38s/it]
Evaluating: 96%|█████████▌| 96/100 [18:08<00:44, 11.23s/it]
Evaluating: 97%|█████████▋| 97/100 [18:19<00:33, 11.17s/it]
Evaluating: 98%|█████████▊| 98/100 [18:29<00:21, 10.72s/it]
Evaluating: 99%|█████████▉| 99/100 [18:39<00:10, 10.70s/it]
Evaluating: 100%|██████████| 100/100 [18:51<00:00, 11.05s/it]
Evaluating: 100%|██████████| 100/100 [18:51<00:00, 11.32s/it]
✨ You're running DeepEval's latest Contextual Precision Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
✨ You're running DeepEval's latest Contextual Recall Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
✨ You're running DeepEval's latest Hallucination Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
Evaluating 100 test case(s) in parallel: | | 0% (0/100) [Time Taken: 00:00, ?test case/s]
‼️ Friendly reminder 😇: You can also run evaluations with ALL of deepeval's
Evaluating 100 test case(s) in parallel: | | 1% (1/100) [Time Taken: 00:14, 14.76s/test case]
Evaluating 100 test case(s) in parallel: |▏ | 2% (2/100) [Time Taken: 00:17, 7.48s/test case]
Evaluating 100 test case(s) in parallel: |▎ | 3% (3/100) [Time Taken: 00:17, 4.37s/test case]
Evaluating 100 test case(s) in parallel: |▌ | 6% (6/100) [Time Taken: 00:18, 1.65s/test case]
Evaluating 100 test case(s) in parallel: |▋ | 7% (7/100) [Time Taken: 00:18, 1.28s/test case]
Evaluating 100 test case(s) in parallel: |▊ | 8% (8/100) [Time Taken: 00:18, 1.00s/test case]
Evaluating 100 test case(s) in parallel: |▉ | 9% (9/100) [Time Taken: 00:19, 1.28test case/s]
Evaluating 100 test case(s) in parallel: |█ | 11% (11/100) [Time Taken: 00:19, 1.68test case/s]
Evaluating 100 test case(s) in parallel: |█▏ | 12% (12/100) [Time Taken: 00:19, 2.08test case/s]
Evaluating 100 test case(s) in parallel: |█▎ | 13% (13/100) [Time Taken: 00:20, 2.23test case/s]
Evaluating 100 test case(s) in parallel: |█▍ | 14% (14/100) [Time Taken: 00:20, 2.52test case/s]
Evaluating 100 test case(s) in parallel: |█▌ | 15% (15/100) [Time Taken: 00:20, 3.02test case/s]
Evaluating 100 test case(s) in parallel: |█▌ | 16% (16/100) [Time Taken: 00:20, 3.25test case/s]
Evaluating 100 test case(s) in parallel: |█▋ | 17% (17/100) [Time Taken: 00:21, 3.03test case/s]
Evaluating 100 test case(s) in parallel: |█▊ | 18% (18/100) [Time Taken: 00:21, 3.34test case/s]
Evaluating 100 test case(s) in parallel: |█▉ | 19% (19/100) [Time Taken: 00:21, 3.79test case/s]
Evaluating 100 test case(s) in parallel: |██▍ | 24% (24/100) [Time Taken: 00:22, 7.02test case/s]
Evaluating 100 test case(s) in parallel: |██▌ | 25% (25/100) [Time Taken: 00:22, 4.82test case/s]
Evaluating 100 test case(s) in parallel: |██▌ | 26% (26/100) [Time Taken: 00:22, 4.77test case/s]
Evaluating 100 test case(s) in parallel: |██▉ | 29% (29/100) [Time Taken: 00:23, 7.25test case/s]
Evaluating 100 test case(s) in parallel: |███ | 30% (30/100) [Time Taken: 00:23, 7.07test case/s]
Evaluating 100 test case(s) in parallel: |███▏ | 32% (32/100) [Time Taken: 00:23, 8.23test case/s]
Evaluating 100 test case(s) in parallel: |███▎ | 33% (33/100) [Time Taken: 00:23, 7.69test case/s]
Evaluating 100 test case(s) in parallel: |███▌ | 36% (36/100) [Time Taken: 00:23, 6.92test case/s]
Evaluating 100 test case(s) in parallel: |███▊ | 38% (38/100) [Time Taken: 00:24, 8.13test case/s]
Evaluating 100 test case(s) in parallel: |████ | 40% (40/100) [Time Taken: 00:24, 9.34test case/s]
Evaluating 100 test case(s) in parallel: |████▏ | 42% (42/100) [Time Taken: 00:24, 8.94test case/s]
Evaluating 100 test case(s) in parallel: |████▍ | 44% (44/100) [Time Taken: 00:24, 8.64test case/s]
Evaluating 100 test case(s) in parallel: |████▌ | 45% (45/100) [Time Taken: 00:24, 8.19test case/s]
Evaluating 100 test case(s) in parallel: |████▌ | 46% (46/100) [Time Taken: 00:25, 7.38test case/s]
Evaluating 100 test case(s) in parallel: |████▋ | 47% (47/100) [Time Taken: 00:25, 7.77test case/s]
Evaluating 100 test case(s) in parallel: |████▉ | 49% (49/100) [Time Taken: 00:25, 9.38test case/s]
Evaluating 100 test case(s) in parallel: |█████ | 50% (50/100) [Time Taken: 00:25, 5.95test case/s]
Evaluating 100 test case(s) in parallel: |█████▏ | 52% (52/100) [Time Taken: 00:26, 6.45test case/s]
Evaluating 100 test case(s) in parallel: |█████▍ | 54% (54/100) [Time Taken: 00:26, 5.37test case/s]
Evaluating 100 test case(s) in parallel: |█████▌ | 55% (55/100) [Time Taken: 00:26, 4.76test case/s]
Evaluating 100 test case(s) in parallel: |█████▌ | 56% (56/100) [Time Taken: 00:27, 4.30test case/s]
Evaluating 100 test case(s) in parallel: |█████▊ | 58% (58/100) [Time Taken: 00:27, 3.80test case/s]
Evaluating 100 test case(s) in parallel: |█████▉ | 59% (59/100) [Time Taken: 00:27, 4.10test case/s]
Evaluating 100 test case(s) in parallel: |██████ | 60% (60/100) [Time Taken: 00:28, 4.72test case/s]
Evaluating 100 test case(s) in parallel: |██████▏ | 62% (62/100) [Time Taken: 00:28, 6.53test case/s]
Evaluating 100 test case(s) in parallel: |██████▎ | 63% (63/100) [Time Taken: 00:28, 6.53test case/s]
Evaluating 100 test case(s) in parallel: |██████▌ | 65% (65/100) [Time Taken: 00:28, 7.69test case/s]
Evaluating 100 test case(s) in parallel: |██████▌ | 66% (66/100) [Time Taken: 00:28, 7.67test case/s]
Evaluating 100 test case(s) in parallel: |██████▊ | 68% (68/100) [Time Taken: 00:28, 8.49test case/s]
Evaluating 100 test case(s) in parallel: |██████▉ | 69% (69/100) [Time Taken: 00:28, 8.44test case/s]
Evaluating 100 test case(s) in parallel: |███████ | 71% (71/100) [Time Taken: 00:29, 9.85test case/s]
Evaluating 100 test case(s) in parallel: |███████▎ | 73% (73/100) [Time Taken: 00:29, 7.05test case/s]
Evaluating 100 test case(s) in parallel: |███████▍ | 74% (74/100) [Time Taken: 00:30, 3.73test case/s]
Evaluating 100 test case(s) in parallel: |███████▌ | 75% (75/100) [Time Taken: 00:30, 3.32test case/s]
Evaluating 100 test case(s) in parallel: |███████▌ | 76% (76/100) [Time Taken: 00:30, 3.69test case/s]
Evaluating 100 test case(s) in parallel: |███████▋ | 77% (77/100) [Time Taken: 00:31, 2.50test case/s]
Evaluating 100 test case(s) in parallel: |███████▉ | 79% (79/100) [Time Taken: 00:31, 3.65test case/s]
Evaluating 100 test case(s) in parallel: |████████ | 80% (80/100) [Time Taken: 00:32, 4.12test case/s]
Evaluating 100 test case(s) in parallel: |████████▏ | 82% (82/100) [Time Taken: 00:32, 4.69test case/s]
Evaluating 100 test case(s) in parallel: |████████▎ | 83% (83/100) [Time Taken: 00:32, 4.05test case/s]
Evaluating 100 test case(s) in parallel: |████████▍ | 84% (84/100) [Time Taken: 00:32, 4.44test case/s]
Evaluating 100 test case(s) in parallel: |████████▌ | 86% (86/100) [Time Taken: 00:33, 5.82test case/s]
Evaluating 100 test case(s) in parallel: |████████▋ | 87% (87/100) [Time Taken: 00:33, 4.68test case/s]
Evaluating 100 test case(s) in parallel: |████████▊ | 88% (88/100) [Time Taken: 00:33, 4.86test case/s]
Evaluating 100 test case(s) in parallel: |████████▉ | 89% (89/100) [Time Taken: 00:34, 2.01test case/s]
Evaluating 100 test case(s) in parallel: |█████████ | 90% (90/100) [Time Taken: 00:35, 1.77test case/s]
Evaluating 100 test case(s) in parallel: |█████████ | 91% (91/100) [Time Taken: 00:36, 1.68test case/s]
Evaluating 100 test case(s) in parallel: |█████████▏| 92% (92/100) [Time Taken: 00:36, 2.15test case/s]
Evaluating 100 test case(s) in parallel: |█████████▍| 94% (94/100) [Time Taken: 00:37, 2.27test case/s]
Evaluating 100 test case(s) in parallel: |█████████▌| 95% (95/100) [Time Taken: 00:37, 2.36test case/s]
Evaluating 100 test case(s) in parallel: |█████████▌| 96% (96/100) [Time Taken: 00:38, 2.02test case/s]
Evaluating 100 test case(s) in parallel: |█████████▋| 97% (97/100) [Time Taken: 00:38, 1.99test case/s]
Evaluating 100 test case(s) in parallel: |█████████▊| 98% (98/100) [Time Taken: 00:41, 1.02test case/s]
Evaluating 100 test case(s) in parallel: |█████████▉| 99% (99/100) [Time Taken: 00:43, 1.34s/test case]
Evaluating 100 test case(s) in parallel: |██████████|100% (100/100) [Time Taken: 00:47, 2.26s/test case]
Evaluating 100 test case(s) in parallel: |██████████|100% (100/100) [Time Taken: 00:47, 2.09test case/s]
✓ Tests finished 🎉! Run 'deepeval login' to save and analyze evaluation results
on Confident AI.
metrics directly on Confident AI instead.
Average Metric Scores:
Contextual Precision 0.7030119047619047
Contextual Recall 0.8576428571428572
Hallucination 0.5340588023088023
Metric Passrates:
Contextual Precision 0.65
Contextual Recall 0.84
Hallucination 0.57