-
Notifications
You must be signed in to change notification settings - Fork 152
tessedit_char_whitelist . detect only predefined chars . #78
Comments
From what I can tell, with your patch you should have the expected behavior. The only question is where did you put your call to |
I put in the image_to_string . def: def image_to_string(image, lang=None, builder=None):
if builder is None:
builder = builders.TextBuilder()
handle = tesseract_raw.init(lang=lang)
lvl_line = tesseract_raw.PageIteratorLevel.TEXTLINE
lvl_word = tesseract_raw.PageIteratorLevel.WORD
try:
# XXX(Jflesch): Issue #51:
# Tesseract TessBaseAPIRecognize() may segfault when the target
# language is not available
clang = lang if lang else "eng"
if clang not in tesseract_raw.get_available_languages(handle):
raise TesseractError(
"no lang",
"language {} is not available".format(clang)
)
tesseract_raw.set_page_seg_mode(
handle, builder.tesseract_layout
)
tesseract_raw.set_image(handle, image)
if "digits" in builder.tesseract_configs:
tesseract_raw.set_is_numeric(handle, True)
## LABEL SPECIFIC ###
if "label" in builder.tesseract_configs:
tesseract_raw.set_is_label(handle, True)
# XXX(JFlesch): PageIterator and ResultIterator are actually the
# very same thing. If it changes, we are screwed.
tesseract_raw.recognize(handle)
res_iterator = tesseract_raw.get_iterator(handle)
if res_iterator is None:
raise TesseractError(
"no script", "no script detected"
)
page_iterator = tesseract_raw.result_iterator_get_page_iterator(
res_iterator
)
while True:
if tesseract_raw.page_iterator_is_at_beginning_of(
page_iterator, lvl_line):
(r, box) = tesseract_raw.page_iterator_bounding_box(
page_iterator, lvl_line
)
assert(r)
box = _tess_box_to_pyocr_box(box)
builder.start_line(box)
last_word_in_line = tesseract_raw.page_iterator_is_at_final_element(
page_iterator, lvl_line, lvl_word
)
word = tesseract_raw.result_iterator_get_utf8_text(
res_iterator, lvl_word
)
if word is not None and word != "":
(r, box) = tesseract_raw.page_iterator_bounding_box(
page_iterator, lvl_word
)
assert(r)
box = _tess_box_to_pyocr_box(box)
builder.add_word(word, box)
if last_word_in_line:
builder.end_line()
if not tesseract_raw.page_iterator_next(page_iterator, lvl_word):
break
finally:
tesseract_raw.cleanup(handle)
return builder.get_output()
` |
by the way ; is there any easy way without a patch ? |
Currently no. It's something that should be handled using a custom Builder class, but currently there isn't the required hooks for such builder. |
I guess your modifications should work. I'll try to have a look at home when possible (I'm at work currently) .. but currently my life is a little bit complicated, so it may take a while, sorry. |
thanks a lot. I wish a decent path which has full of light for your life journey my friend. txh |
Actually, if you're ok with using Tesseract (through fork()+exec()) instead of libtesseract, you can use a custom builder. Something along those lines should work: import pyocr
import pyocr.builders
import pyocr.tesseract
class MyBuilder(pyocr.builders.TextBuilder):
def __init__(self):
self.tesseract_configs += ["-c", "tessedit_char_whitelist=0123456789ABNOPRSTUVYZXW"]
builder = MyBuilder()
txt = pyocr.tesseract.image_to_string(
Image.open('test.png'),
builder=builder
) |
(I haven't tested though) |
For testing purposes:
I am debugging and I have below command and config varibables:
In this debug still result NOT whitelisted ?? I am confused at that moment... |
If I remember correctly, the argument order does matter to Tesseract, so if you're using the LineBoxBuilder as base, I would actually suggest the following builder: class MyBuilder(pyocr.builders.LineBoxBuilder):
def __init__(self):
super().__init__()
self.tesseract_configs = ["-c", "tessedit_char_whitelist=0123456789ABNOPRSTUVYZXW"] + self.tesseract_configs The idea with this change is to have the arguments "-c" "tessedit_char_whitelist..." before the file config argument |
Also, did you make sure to use explicitly |
finally 👍
Thanks now all is fine and working.. ['tesseract', 'input.bmp', 'output', '-l', 'eng', '-psm', '1', '-c', 'tessedit_char_whitelist=39BN', 'hocr', '-c tessedit_char_whitelist=01239ABN'] My Best... |
@MyraBaba @jflesch I am also trying to build custom class TesseractCustomBuilder(pyocr.builders.LineBoxBuilder):
def __init__(self):
super().__init__()
self.tesseract_configs = ['-c tessedit_char_blacklist=K'] + self.tesseract_configs builder = TesseractCustomBuilder()
boxes = pyocr.tesseract.image_to_string(Image.fromarray(image),
builder=builder) This is the print I am getting at Please look, if any mistake that I am doing. |
@mit456 Have you tried from the command line directly ? (to make sure that Tesseract actually takes into account the option you specified) |
@jflesch I found it from |
Hi,
We are using pyocr to detect labels which is only contains alphanumeric chars and digits.
How I can Apply a specific list of the chars to be detected . ?
I try to :
in libtesseract/__init__py
and in tesseract_raw.py:
Bu I couldn't succeed ?
Is there anyway to do it more simple way, like:
thanks
The text was updated successfully, but these errors were encountered: