Skip to content

Extend the TesseractOCRParser with PDF output#271

Open
rsoika wants to merge 1 commit intoapache:masterfrom
rsoika:master
Open

Extend the TesseractOCRParser with PDF output#271
rsoika wants to merge 1 commit intoapache:masterfrom
rsoika:master

Conversation

@rsoika
Copy link
Copy Markdown

@rsoika rsoika commented Apr 26, 2019

Currently the TesseractOCRParser supports two output formats: plain text and HOCR. The second was recently added by Eric Pugh (#133). My question is if we should add the third output option 'PDF' which is provided by Tesseract?

I am not sure if it is enough to add the output type as I did in the TesseractOCRConfig.

The other discussion point is if this feature fits the focus of the Tika Project. See the discussion here:
https://lists.apache.org/thread.html/d1c65367a8bfe13ebc977f6aff8abdfc3e9e09dbce429411dd554840@%3Cuser.tika.apache.org%3E

Currently the TesseractOCRParser supports two output formats: plain text
and HOCR. The second was recently added by Eric Pugh. My question is if
we should add the third output option 'PDF' which is provided by
Tesseract.
@changetoblow
Copy link
Copy Markdown

Hi, why do I use tika-app-1.20.jar to identify the PDF, but I cannot identify the content?There are only images in the PDF, and there is text on the image. I guess it should call tesserect OCR to do the recognition, but I don't find that it does the work. Why

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants