-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot read .doc file #10
Comments
Hi,
I tried to write an extractor for it myself, however, I was not able to successfully extract the xml contents from the .doc file. Here is the code, if you want to play with it: // source/parsers/docx.ts
// The text extracter for DOCX/DOC files.
import { type Buffer } from 'node:buffer'
import { extractRawText as parseWordFile } from 'mammoth'
import { unzip } from 'fflate'
import { parseStringPromise as xmlToJson } from 'xml2js'
import encoding from 'text-encoding'
import type { TextExtractionMethod } from '../lib.js'
export class DocExtractor implements TextExtractionMethod {
/**
* The type(s) of input acceptable to this method.
*/
mimes = [
'application/x-cfb',
'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
]
/**
* Extract text from a DOCX/DOC file if possible.
*
* @param payload The input and its type.
* @returns The text extracted from the input.
*/
apply = async (input: Buffer): Promise<string> => {
try {
// Convert the DOCX to text and return the text.
const parsedDocx = await parseWordFile({ buffer: input })
return parsedDocx.value
} catch (caughtError: unknown) {
// If the file is a DOC file, then JSZIP will fail to unzip it.
const error = caughtError as Error
if (error.message?.includes('Corrupted zip or bug')) {
const contents = await unzipBuffer(input)
const json = await xmlToJson(contents)
const lines = await parseDocSection(json)
const formattedText = lines?.join('\n') + ''
return formattedText
} else {
// If it is not a DOC file, let the error propagate.
throw caughtError
}
}
}
}
/**
* Unzip a DOC file, and return the XML in it.
*
* @param buffer The buffer containing the file.
*
* @returns The XML.
*/
const unzipBuffer = async (input: Buffer): Promise<Buffer> => {
// Convert the buffer to a uint-8 array, and pass it to the unzip function.
const zipBuffer = new Uint8Array(input.buffer)
const doc = (await new Promise((resolve, reject) => {
unzip(zipBuffer, (error, result) => {
if (error) reject(error)
else resolve(result)
})
})) as any
const file = doc['word/document.xml']
if (!file) throw new Error('Invalid .doc file, could not find document.xml.')
return file
}
/**
* Extracts text from a section of the document, recursively.
*
* @param docSection The section of the doc, converted to JSON from XML.
* @param collectedText The lines of text parsed from the document so far.
*
* @returns The lines of text in the document.
*/
const parseDocSection = async (
docSection: any,
collectedText?: string[],
): Promise<string[] | undefined> => {
// Keep track of the text being collected.
const beingCollectedText = collectedText ?? []
// Parse the section according to what type it is.
if (Array.isArray(docSection)) {
// If it is, loop through the elements of the array.
for (const element of docSection) {
// Collect all the pieces of text from the array.
if (typeof element === 'string' && element !== '') {
beingCollectedText.push(element)
} else {
// However, if it is an object or another array, call this function
// again to parse that.
await parseDocSection(element, beingCollectedText)
}
}
// Finally, return the collected text.
return beingCollectedText
}
// If the section is an object, loop through its properties.
if (typeof docSection === 'object') {
for (const property of Object.keys(docSection)) {
// Get the value of the property.
const value = docSection[property]
// The `docx` format stores the actual text inside the `w:t` or `_`
// properties, so extract text from those properties.
// Check if it is a string or array that contains a string. If it is
// either, then collect the text content.
if (typeof value === 'string') {
if ((property === 'w:t' || property === '_') && value !== '') {
beingCollectedText.push(value)
}
} else if (typeof value[0] === 'string') {
if ((property === 'w:t' || property === '_') && value[0] !== '') {
beingCollectedText.push(value[0])
}
} else {
// However, if it is an object or another array, call this function
// again to parse that.
await parseDocSection(value, beingCollectedText)
}
}
// Finally, return the collected text.
return beingCollectedText
}
} The unzip library,
If you can fix it or work around it in any way, please do let me know! |
@abedshaaban Possible solutions for the error:-
Hope this helps, Thanks |
@gamemaker1 The error you're encountering with the unzip function is likely because DOC files are not simple zip archives like DOCX files. DOC files use a different format, known as the Compound File Binary Format (CFBF), also known as OLE2 or just "doc" format, which requires a different approach to extract its contents. Sample snippet that handles both DOCX and DOC files correctly. This script uses the cfb library to handle DOC files and continues to use mammoth for DOCX files.
Hope this helps |
Description
An error occurred when reading a .doc file.
I looked into the code and the type declaration
application/x-cfb
is not included in theMimeType
ordoc
inFileExtension
.Library version
3.0.2
Node version
20.9.0
Typescript version (if you are using it)
No response
The text was updated successfully, but these errors were encountered: