Out of memory when running OCR for a lot of images #836

tinganle · 2020-01-08T17:36:20Z

java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes (665M) > maxPhysicalBytes (600M)
at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:589)
at org.bytedeco.javacpp.Pointer.init(Pointer.java:125)
at org.bytedeco.tesseract.TessBaseAPI.allocate(Native Method)
at org.bytedeco.tesseract.TessBaseAPI.(TessBaseAPI.java:35)

If I set the heap size bigger, it will run into this error eventually. We follow the basic example. We create a new instance of BytedecoOcrAPI and call init() for each 'document' which consists of multiple image files that we call doOcr() for each image file.

public class BytedecoOcrAPI implements OcrAPI {

private TessBaseAPI api;
private String dataPath;

public BytedecoOcrAPI(String dataPath)  {
    this.api = new TessBaseAPI();
    this.dataPath = dataPath;
}

public void init() throws OcrException {
    if (api.Init(dataPath, "eng", 0) != 0) {
        throw new OcrException("failed to read tesseract data file from ");
    }
    api.SetVariable("tessedit_char_blacklist", "ﬁﬂﬀﬃﬄﬅ");
    api.SetVariable("hocr_font_info", "0");
    api.SetPageSegMode(1);
}

public String doOcr(File file) throws OcrException {
    return doOcr(file.getPath());
}

public String doOcr(String filePathName) throws OcrException {
    PIX image = pixRead(filePathName);
    api.SetImage(image);

    BytePointer output = api.GetHOCRText(0);

    try {
        return output.getString("UTF-8");
    } catch (IOException e) {
        throw new OcrException(e);
    } finally {
        output.deallocate();
        pixDestroy(image);
    }
}

public void close() {
    api.End();
}

}

The text was updated successfully, but these errors were encountered:

saudet · 2020-01-09T03:00:41Z

600 MB isn't a lot of memory. You'll probably need to increase that.

saudet · 2020-01-09T03:02:19Z

Just to be sure, add a call to api.deallocate() right after api.End(). Let me know if that doesn't fix anything though.

tinganle · 2020-01-09T13:54:47Z

600 MB isn't a lot of memory. You'll probably need to increase that.

Yes. 600MB was because I tried to limit heap size to 300M on my local machine to reproduce the problem faster. When we ran the same process on Linux with more memory, we saw it threw the same error around 8G.

Also we we monitored the Java Heap size and the JVM process memory size, there's a huge difference. On my local with heap size max to 300M, the JAVA heap size stayed below 250M, but the JVM process could use 2G memory. We do parallel OCR processing, for each thread, there's only one image file being OCR-ed at a given time. If the memory is cleaned up properly, ideally the memory usage shouldn't grow.

I will try api.deallocate(). Any other ideas are much appreciated. Thanks!

tinganle · 2020-01-10T17:47:46Z

Hi @saudet ,

When I debug, both output and image have null deallocator at the following statements. Calling output.deallocate() doesn't seem doing anything. Is this the desired behavior?

    output.deallocate();
    pixDestroy(image);

Thanks.

saudet · 2020-01-10T23:33:52Z

Yes, those are just pointers returned from native functions, so JavaCPP doesn't know how to deallocate them.

tinganle · 2020-01-28T17:40:56Z

Update: calling TessDeleteText(output) after each OCR greatly helped the memory issue (not fully resolved yet).

saudet added help wanted question labels Jan 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory when running OCR for a lot of images #836

Out of memory when running OCR for a lot of images #836

tinganle commented Jan 8, 2020 •

edited

Loading

saudet commented Jan 9, 2020

saudet commented Jan 9, 2020

tinganle commented Jan 9, 2020

tinganle commented Jan 10, 2020

saudet commented Jan 10, 2020

tinganle commented Jan 28, 2020

Out of memory when running OCR for a lot of images #836

Out of memory when running OCR for a lot of images #836

Comments

tinganle commented Jan 8, 2020 • edited Loading

saudet commented Jan 9, 2020

saudet commented Jan 9, 2020

tinganle commented Jan 9, 2020

tinganle commented Jan 10, 2020

saudet commented Jan 10, 2020

tinganle commented Jan 28, 2020

tinganle commented Jan 8, 2020 •

edited

Loading