tessedit_char_whitelist . detect only predefined chars . #78

MyraBaba · 2017-10-02T12:58:35Z

Hi,

We are using pyocr to detect labels which is only contains alphanumeric chars and digits.

How I can Apply a specific list of the chars to be detected . ?

I try to :

in libtesseract/__init__py

if "label" in builder.tesseract_configs:
            tesseract_raw.set_is_label(handle, True)

and in tesseract_raw.py:

def set_is_label(handle, mode):
    global g_libtesseract
    assert(g_libtesseract)

    if mode:
        # wl = b"0123456789ABCDEFGHIJKLMNOPRSTUVYZXW"
        wl = b"0123456789ABNOPRSTUVYZXW"

    else:
        wl = b""

    g_libtesseract.TessBaseAPISetVariable(
        ctypes.c_void_p(handle),
        b"tessedit_char_whitelist",
        wl
    )

Bu I couldn't succeed ?

Is there anyway to do it more simple way, like:

tool.image_to_string(
            Image.open("tmp.png"),
            lang="eng",
            tessedit_char_whitelist = "0123456789ABNOPRSTUVYZXW"
            builder=pyocr.builders.LineBoxBuilder()
        )

thanks

jflesch · 2017-10-02T13:17:59Z

From what I can tell, with your patch you should have the expected behavior. The only question is where did you put your call to set_is_label() in libtesseract/init.py exactly ?

MyraBaba · 2017-10-02T15:43:18Z

I put in the image_to_string . def:

def image_to_string(image, lang=None, builder=None):
    if builder is None:
        builder = builders.TextBuilder()
    handle = tesseract_raw.init(lang=lang)

    lvl_line = tesseract_raw.PageIteratorLevel.TEXTLINE
    lvl_word = tesseract_raw.PageIteratorLevel.WORD

    try:
        # XXX(Jflesch): Issue #51:
        # Tesseract TessBaseAPIRecognize() may segfault when the target
        # language is not available
        clang = lang if lang else "eng"
        if clang not in tesseract_raw.get_available_languages(handle):
            raise TesseractError(
                "no lang",
                "language {} is not available".format(clang)
            )

        tesseract_raw.set_page_seg_mode(
            handle, builder.tesseract_layout
        )

        tesseract_raw.set_image(handle, image)
        if "digits" in builder.tesseract_configs:
            tesseract_raw.set_is_numeric(handle, True)

        ## LABEL SPECIFIC ###

        if "label" in builder.tesseract_configs:
            tesseract_raw.set_is_label(handle, True)


        # XXX(JFlesch): PageIterator and ResultIterator are actually the
        # very same thing. If it changes, we are screwed.
        tesseract_raw.recognize(handle)
        res_iterator = tesseract_raw.get_iterator(handle)
        if res_iterator is None:
            raise TesseractError(
                "no script", "no script detected"
            )
        page_iterator = tesseract_raw.result_iterator_get_page_iterator(
            res_iterator
        )

        while True:
            if tesseract_raw.page_iterator_is_at_beginning_of(
                    page_iterator, lvl_line):
                (r, box) = tesseract_raw.page_iterator_bounding_box(
                    page_iterator, lvl_line
                )
                assert(r)
                box = _tess_box_to_pyocr_box(box)
                builder.start_line(box)

            last_word_in_line = tesseract_raw.page_iterator_is_at_final_element(
                page_iterator, lvl_line, lvl_word
            )

            word = tesseract_raw.result_iterator_get_utf8_text(
                res_iterator, lvl_word
            )

            if word is not None and word != "":
                (r, box) = tesseract_raw.page_iterator_bounding_box(
                    page_iterator, lvl_word
                )
                assert(r)
                box = _tess_box_to_pyocr_box(box)
                builder.add_word(word, box)

                if last_word_in_line:
                    builder.end_line()

            if not tesseract_raw.page_iterator_next(page_iterator, lvl_word):
                break

    finally:
        tesseract_raw.cleanup(handle)

    return builder.get_output()

`

MyraBaba · 2017-10-02T15:44:47Z

by the way ; is there any easy way without a patch ?

jflesch · 2017-10-02T15:47:52Z

Currently no. It's something that should be handled using a custom Builder class, but currently there isn't the required hooks for such builder.

jflesch · 2017-10-02T15:49:07Z

I guess your modifications should work. I'll try to have a look at home when possible (I'm at work currently) .. but currently my life is a little bit complicated, so it may take a while, sorry.

MyraBaba · 2017-10-02T15:52:26Z

thanks a lot.

I wish a decent path which has full of light for your life journey my friend.

txh

jflesch · 2017-10-02T15:58:55Z

Actually, if you're ok with using Tesseract (through fork()+exec()) instead of libtesseract, you can use a custom builder.

Something along those lines should work:

import pyocr
import pyocr.builders
import pyocr.tesseract


class MyBuilder(pyocr.builders.TextBuilder):
    def __init__(self):
        self.tesseract_configs += ["-c", "tessedit_char_whitelist=0123456789ABNOPRSTUVYZXW"]


builder = MyBuilder()
txt = pyocr.tesseract.image_to_string(
    Image.open('test.png'),
    builder=builder
)

jflesch · 2017-10-02T15:59:37Z

(I haven't tested though)

MyraBaba · 2017-10-02T16:36:08Z

For testing purposes:
in tesseract.py line 265 there is :

   ` command += configs`

I am debugging and I have below command and config varibables:

command: <class 'list'>: ['tesseract', 'input.bmp', 'output', '-l', 'eng', '-psm', '1', 'hocr', '-c tessedit_char_whitelist=01239ABN']

config:<class 'list'>: ['hocr', '-c tessedit_char_whitelist=01239ABN']

In this debug still result NOT whitelisted ?? I am confused at that moment...

jflesch · 2017-10-02T16:42:06Z

If I remember correctly, the argument order does matter to Tesseract, so if you're using the LineBoxBuilder as base, I would actually suggest the following builder:

class MyBuilder(pyocr.builders.LineBoxBuilder):
    def __init__(self):
        super().__init__()
        self.tesseract_configs = ["-c", "tessedit_char_whitelist=0123456789ABNOPRSTUVYZXW"] + self.tesseract_configs

The idea with this change is to have the arguments "-c" "tessedit_char_whitelist..." before the file config argument hocr.

jflesch · 2017-10-02T16:43:35Z

Also, did you make sure to use explicitly pyocr.tesseract instead of the first tool/module provided by tool.get_available_tools() ?

MyraBaba · 2017-10-02T16:48:36Z

finally 👍

ATTENTION : The ARGument order DOES MATTER to Tesseract

Thanks now all is fine and working..

['tesseract', 'input.bmp', 'output', '-l', 'eng', '-psm', '1', '-c', 'tessedit_char_whitelist=39BN', 'hocr', '-c tessedit_char_whitelist=01239ABN']

My Best...

mit456 · 2018-03-20T10:31:06Z

@MyraBaba @jflesch I am also trying to build custom LineBoxBuilder and applying tessedit_char_blacklist=K now for testing but I need to apply some other config parameters too like tessedit_enable_dict_correction, language_model_ngram_order .. etc but it seems configurations are not getting applied,
This is the following code I am using

class TesseractCustomBuilder(pyocr.builders.LineBoxBuilder):
     def __init__(self):
        super().__init__()
        self.tesseract_configs = ['-c tessedit_char_blacklist=K'] + self.tesseract_configs

builder = TesseractCustomBuilder()
boxes = pyocr.tesseract.image_to_string(Image.fromarray(image),
                                                builder=builder)

This is the print I am getting at L-277 tesseract.py ['-c tessedit_char_blacklist=K', 'hocr'] but it looks K is getting detected.

Please look, if any mistake that I am doing.

jflesch · 2018-03-20T10:47:36Z

@mit456 Have you tried from the command line directly ? (to make sure that Tesseract actually takes into account the option you specified)

mit456 · 2018-03-20T11:15:48Z

@jflesch I found it from tesseract --print-parameters and but when I am trying to pass from command line it is not working, don't think it's a problem of pyocr. From next time I will try CLI first.

jflesch added the support label Oct 2, 2017

jflesch added feature request and removed support labels Oct 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tessedit_char_whitelist . detect only predefined chars . #78

tessedit_char_whitelist . detect only predefined chars . #78

MyraBaba commented Oct 2, 2017 •

edited by jflesch

Loading

jflesch commented Oct 2, 2017

MyraBaba commented Oct 2, 2017 •

edited by jflesch

Loading

MyraBaba commented Oct 2, 2017

jflesch commented Oct 2, 2017

jflesch commented Oct 2, 2017

MyraBaba commented Oct 2, 2017

jflesch commented Oct 2, 2017 •

edited

Loading

jflesch commented Oct 2, 2017

MyraBaba commented Oct 2, 2017 •

edited

Loading

jflesch commented Oct 2, 2017 •

edited

Loading

jflesch commented Oct 2, 2017 •

edited

Loading

MyraBaba commented Oct 2, 2017

mit456 commented Mar 20, 2018

jflesch commented Mar 20, 2018 •

edited

Loading

mit456 commented Mar 20, 2018

tessedit_char_whitelist . detect only predefined chars . #78

tessedit_char_whitelist . detect only predefined chars . #78

Comments

MyraBaba commented Oct 2, 2017 • edited by jflesch Loading

jflesch commented Oct 2, 2017

MyraBaba commented Oct 2, 2017 • edited by jflesch Loading

MyraBaba commented Oct 2, 2017

jflesch commented Oct 2, 2017

jflesch commented Oct 2, 2017

MyraBaba commented Oct 2, 2017

jflesch commented Oct 2, 2017 • edited Loading

jflesch commented Oct 2, 2017

MyraBaba commented Oct 2, 2017 • edited Loading

jflesch commented Oct 2, 2017 • edited Loading

jflesch commented Oct 2, 2017 • edited Loading

MyraBaba commented Oct 2, 2017

mit456 commented Mar 20, 2018

jflesch commented Mar 20, 2018 • edited Loading

mit456 commented Mar 20, 2018

MyraBaba commented Oct 2, 2017 •

edited by jflesch

Loading

MyraBaba commented Oct 2, 2017 •

edited by jflesch

Loading

jflesch commented Oct 2, 2017 •

edited

Loading

MyraBaba commented Oct 2, 2017 •

edited

Loading

jflesch commented Oct 2, 2017 •

edited

Loading

jflesch commented Oct 2, 2017 •

edited

Loading

jflesch commented Mar 20, 2018 •

edited

Loading