Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in ingestion of .docx file type due ocrHighResolution when using Content Understanding. #2242

Open
mrisahoo1 opened this issue Dec 18, 2024 · 1 comment · May be fixed by #2260
Open
Assignees
Labels
bug Something isn't working

Comments

@mrisahoo1
Copy link

mrisahoo1 commented Dec 18, 2024

Please provide us with the following information:

This issue is for a: (mark with an x)

- [ x] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

scripts/prepdocs.ps1

Any log messages given by the failure

INFO Extracting text from 'C:\Users\USER\Desktop\mco-vision/data\Document\Sample.docx' using Azure Document pdfparser.py:66
Intelligence
Traceback (most recent call last):
File "C:\Users\USER\Desktop\mco-vision\app\backend\prepdocs.py", line 439, in
loop.run_until_complete(main(ingestion_strategy, setup_index=not args.remove and not args.removeall))
File "C:\Users\USER\AppData\Local\Programs\Python\Python311\Lib\asyncio\base_events.py", line 650, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision\app\backend\prepdocs.py", line 244, in main
await strategy.run()
File "C:\Users\USER\Desktop\mco-vision\app\backend\prepdocslib\filestrategy.py", line 101, in run
sections = await parse_file(file, self.file_processors, self.category, self.image_embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision\app\backend\prepdocslib\filestrategy.py", line 29, in parse_file
pages = [page async for page in processor.parser.parse(content=file.content)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision\app\backend\prepdocslib\filestrategy.py", line 29, in
pages = [page async for page in processor.parser.parse(content=file.content)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision\app\backend\prepdocslib\pdfparser.py", line 80, in parse
poller = await document_intelligence_client.begin_analyze_document(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision.venv\Lib\site-packages\azure\core\tracing\decorator_async.py", line 94, in wrapper_use_tracer
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision.venv\Lib\site-packages\azure\ai\documentintelligence\aio_operations_patch.py", line 529, in begin_analyze_document
raw_result = await self._analyze_document_initial(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\Desktop\mco-vision.venv\Lib\site-packages\azure\ai\documentintelligence\aio_operations_operations.py", line 158, in _analyze_document_initial
raise HttpResponseError(response=response, model=error)
azure.core.exceptions.HttpResponseError: (InvalidArgument) Invalid argument.
Code: InvalidArgument
Message: Invalid argument.
Inner error: {
"code": "InvalidParameter",
"message": "The parameter ocrHighResolution for file type Docx is invalid: The feature is invalid or not supported."
}

Expected/desired behavior

OS and Version?

Windows 11

azd version?

azd version 1.11.0 (commit 5b92e0687e1fa96dfc8292f4b900c0c58610b6a5)

Versions

Mention any other details that might be useful

Please let me know what resolution for this should be, as in the previous releases the .docx easily got ingested.
But when used with Content Understanding it gives the ocrHighResolution error.
It is a critical issue, your help is appreciated.


Thanks! We'll be in touch soon.

@mrisahoo1 mrisahoo1 changed the title Error in ingestion of .docx file type due ocrHighResolution Error in ingestion of .docx file type due ocrHighResolution when using Content Understanding. Dec 18, 2024
@pamelafox pamelafox self-assigned this Jan 8, 2025
@pamelafox pamelafox added the bug Something isn't working label Jan 8, 2025
@pamelafox
Copy link
Collaborator

Thanks for filing! I've just discussed this with the Content Understanding / Doc Intelligence team. They do not support figure extraction for docx, pptx, and xlsx. They recommend exporting those to PDF instead, if you think they have figures in them. We'll add code to this repo that warns you when trying to use the media description feature with those formats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants