Skip to content

Latest commit

 

History

History
16 lines (13 loc) · 454 Bytes

README.md

File metadata and controls

16 lines (13 loc) · 454 Bytes

Docx-Parser

This repository is a python implimentation of a document parser. Given a docx file the output will be a csv file holding all the parsed data preserving the html tags and formats.

  • Note - The sole purpose of the script is to extract exam related data and to preserve hierarchial forms of table data, images and equation related special characters only.

Requirments, pypandoc, mammoth, mammoth, BeautifulSoup, re, os, PIL, csv,