Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory when running OCR for a lot of images #836

Open
tinganle opened this issue Jan 8, 2020 · 6 comments
Open

Out of memory when running OCR for a lot of images #836

tinganle opened this issue Jan 8, 2020 · 6 comments

Comments

@tinganle
Copy link

tinganle commented Jan 8, 2020

java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes (665M) > maxPhysicalBytes (600M)
at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:589)
at org.bytedeco.javacpp.Pointer.init(Pointer.java:125)
at org.bytedeco.tesseract.TessBaseAPI.allocate(Native Method)
at org.bytedeco.tesseract.TessBaseAPI.(TessBaseAPI.java:35)

If I set the heap size bigger, it will run into this error eventually. We follow the basic example. We create a new instance of BytedecoOcrAPI and call init() for each 'document' which consists of multiple image files that we call doOcr() for each image file.

public class BytedecoOcrAPI implements OcrAPI {

private TessBaseAPI api;
private String dataPath;

public BytedecoOcrAPI(String dataPath)  {
    this.api = new TessBaseAPI();
    this.dataPath = dataPath;
}

public void init() throws OcrException {
    if (api.Init(dataPath, "eng", 0) != 0) {
        throw new OcrException("failed to read tesseract data file from ");
    }
    api.SetVariable("tessedit_char_blacklist", "fiflffffifflſt");
    api.SetVariable("hocr_font_info", "0");
    api.SetPageSegMode(1);
}

public String doOcr(File file) throws OcrException {
    return doOcr(file.getPath());
}

public String doOcr(String filePathName) throws OcrException {
    PIX image = pixRead(filePathName);
    api.SetImage(image);

    BytePointer output = api.GetHOCRText(0);

    try {
        return output.getString("UTF-8");
    } catch (IOException e) {
        throw new OcrException(e);
    } finally {
        output.deallocate();
        pixDestroy(image);
    }
}

public void close() {
    api.End();
}

}

@saudet
Copy link
Member

saudet commented Jan 9, 2020

600 MB isn't a lot of memory. You'll probably need to increase that.

@saudet
Copy link
Member

saudet commented Jan 9, 2020

Just to be sure, add a call to api.deallocate() right after api.End(). Let me know if that doesn't fix anything though.

@tinganle
Copy link
Author

tinganle commented Jan 9, 2020

600 MB isn't a lot of memory. You'll probably need to increase that.

Yes. 600MB was because I tried to limit heap size to 300M on my local machine to reproduce the problem faster. When we ran the same process on Linux with more memory, we saw it threw the same error around 8G.

Also we we monitored the Java Heap size and the JVM process memory size, there's a huge difference. On my local with heap size max to 300M, the JAVA heap size stayed below 250M, but the JVM process could use 2G memory. We do parallel OCR processing, for each thread, there's only one image file being OCR-ed at a given time. If the memory is cleaned up properly, ideally the memory usage shouldn't grow.

I will try api.deallocate(). Any other ideas are much appreciated. Thanks!

@tinganle
Copy link
Author

Hi @saudet ,

When I debug, both output and image have null deallocator at the following statements. Calling output.deallocate() doesn't seem doing anything. Is this the desired behavior?

    output.deallocate();
    pixDestroy(image);

Thanks.

@saudet
Copy link
Member

saudet commented Jan 10, 2020

Yes, those are just pointers returned from native functions, so JavaCPP doesn't know how to deallocate them.

@tinganle
Copy link
Author

Update: calling TessDeleteText(output) after each OCR greatly helped the memory issue (not fully resolved yet).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants