You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- [ x] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)
Minimal steps to reproduce
scripts/prepdocs.ps1
Any log messages given by the failure
INFO Extracting text from 'C:\Users\USER\Desktop\mco-vision/data\Document\Sample.docx' using Azure Document pdfparser.py:66
Intelligence
Traceback (most recent call last):
File "C:\Users\USER\Desktop\mco-vision\app\backend\prepdocs.py", line 439, in
loop.run_until_complete(main(ingestion_strategy, setup_index=not args.remove and not args.removeall))
File "C:\Users\USER\AppData\Local\Programs\Python\Python311\Lib\asyncio\base_events.py", line 650, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision\app\backend\prepdocs.py", line 244, in main
await strategy.run()
File "C:\Users\USER\Desktop\mco-vision\app\backend\prepdocslib\filestrategy.py", line 101, in run
sections = await parse_file(file, self.file_processors, self.category, self.image_embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision\app\backend\prepdocslib\filestrategy.py", line 29, in parse_file
pages = [page async for page in processor.parser.parse(content=file.content)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision\app\backend\prepdocslib\filestrategy.py", line 29, in
pages = [page async for page in processor.parser.parse(content=file.content)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision\app\backend\prepdocslib\pdfparser.py", line 80, in parse
poller = await document_intelligence_client.begin_analyze_document(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision.venv\Lib\site-packages\azure\core\tracing\decorator_async.py", line 94, in wrapper_use_tracer
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision.venv\Lib\site-packages\azure\ai\documentintelligence\aio_operations_patch.py", line 529, in begin_analyze_document
raw_result = await self._analyze_document_initial(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision.venv\Lib\site-packages\azure\ai\documentintelligence\aio_operations_operations.py", line 158, in _analyze_document_initial
raise HttpResponseError(response=response, model=error)
azure.core.exceptions.HttpResponseError: (InvalidArgument) Invalid argument.
Code: InvalidArgument
Message: Invalid argument.
Inner error: {
"code": "InvalidParameter",
"message": "The parameter ocrHighResolution for file type Docx is invalid: The feature is invalid or not supported."
}
Expected/desired behavior
OS and Version?
Windows 11
azd version?
azd version 1.11.0 (commit 5b92e0687e1fa96dfc8292f4b900c0c58610b6a5)
Versions
Mention any other details that might be useful
Please let me know what resolution for this should be, as in the previous releases the .docx easily got ingested.
But when used with Content Understanding it gives the ocrHighResolution error.
It is a critical issue, your help is appreciated.
Thanks! We'll be in touch soon.
The text was updated successfully, but these errors were encountered:
mrisahoo1
changed the title
Error in ingestion of .docx file type due ocrHighResolution
Error in ingestion of .docx file type due ocrHighResolution when using Content Understanding.
Dec 18, 2024
Thanks for filing! I've just discussed this with the Content Understanding / Doc Intelligence team. They do not support figure extraction for docx, pptx, and xlsx. They recommend exporting those to PDF instead, if you think they have figures in them. We'll add code to this repo that warns you when trying to use the media description feature with those formats.
This issue is for a: (mark with an
x
)Minimal steps to reproduce
Any log messages given by the failure
INFO Extracting text from 'C:\Users\USER\Desktop\mco-vision/data\Document\Sample.docx' using Azure Document pdfparser.py:66
Intelligence
Traceback (most recent call last):
File "C:\Users\USER\Desktop\mco-vision\app\backend\prepdocs.py", line 439, in
loop.run_until_complete(main(ingestion_strategy, setup_index=not args.remove and not args.removeall))
File "C:\Users\USER\AppData\Local\Programs\Python\Python311\Lib\asyncio\base_events.py", line 650, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision\app\backend\prepdocs.py", line 244, in main
await strategy.run()
File "C:\Users\USER\Desktop\mco-vision\app\backend\prepdocslib\filestrategy.py", line 101, in run
sections = await parse_file(file, self.file_processors, self.category, self.image_embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision\app\backend\prepdocslib\filestrategy.py", line 29, in parse_file
pages = [page async for page in processor.parser.parse(content=file.content)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision\app\backend\prepdocslib\filestrategy.py", line 29, in
pages = [page async for page in processor.parser.parse(content=file.content)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision\app\backend\prepdocslib\pdfparser.py", line 80, in parse
poller = await document_intelligence_client.begin_analyze_document(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision.venv\Lib\site-packages\azure\core\tracing\decorator_async.py", line 94, in wrapper_use_tracer
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision.venv\Lib\site-packages\azure\ai\documentintelligence\aio_operations_patch.py", line 529, in begin_analyze_document
raw_result = await self._analyze_document_initial(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision.venv\Lib\site-packages\azure\ai\documentintelligence\aio_operations_operations.py", line 158, in _analyze_document_initial
raise HttpResponseError(response=response, model=error)
azure.core.exceptions.HttpResponseError: (InvalidArgument) Invalid argument.
Code: InvalidArgument
Message: Invalid argument.
Inner error: {
"code": "InvalidParameter",
"message": "The parameter ocrHighResolution for file type Docx is invalid: The feature is invalid or not supported."
}
Expected/desired behavior
OS and Version?
azd version?
Versions
Mention any other details that might be useful
Please let me know what resolution for this should be, as in the previous releases the .docx easily got ingested.
But when used with Content Understanding it gives the ocrHighResolution error.
It is a critical issue, your help is appreciated.
The text was updated successfully, but these errors were encountered: