Extend the TesseractOCRParser with PDF output by rsoika · Pull Request #271 · apache/tika

rsoika · 2019-04-26T06:56:37Z

Currently the TesseractOCRParser supports two output formats: plain text and HOCR. The second was recently added by Eric Pugh (#133). My question is if we should add the third output option 'PDF' which is provided by Tesseract?

I am not sure if it is enough to add the output type as I did in the TesseractOCRConfig.

The other discussion point is if this feature fits the focus of the Tika Project. See the discussion here:
https://lists.apache.org/thread.html/d1c65367a8bfe13ebc977f6aff8abdfc3e9e09dbce429411dd554840@%3Cuser.tika.apache.org%3E

Currently the TesseractOCRParser supports two output formats: plain text and HOCR. The second was recently added by Eric Pugh. My question is if we should add the third output option 'PDF' which is provided by Tesseract.

changetoblow · 2019-05-08T09:22:02Z

Hi, why do I use tika-app-1.20.jar to identify the PDF, but I cannot identify the content?There are only images in the PDF, and there is text on the image. I guess it should call tesserect OCR to do the recognition, but I don't find that it does the work. Why

Extend the TesseractOCRParser with PDF output

5b58869

Currently the TesseractOCRParser supports two output formats: plain text and HOCR. The second was recently added by Eric Pugh. My question is if we should add the third output option 'PDF' which is provided by Tesseract.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend the TesseractOCRParser with PDF output#271

Extend the TesseractOCRParser with PDF output#271
rsoika wants to merge 1 commit intoapache:masterfrom
rsoika:master

rsoika commented Apr 26, 2019

Uh oh!

changetoblow commented May 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rsoika commented Apr 26, 2019

Uh oh!

changetoblow commented May 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants