PDF Editor for repairing book scan OCR? (lemmy.ml)

submitted 10 months ago* (last edited 10 months ago) by zabadoh@lemmy.ml to c/foss@beehaw.org

5 comments fedilink hide all child comments

I have a book scan that came back from a book scanning company.

The scan images were fine, but the OCR text in the PDF are whacky, due to eccentric fonts, dirt, etc.

So I'm going to have to go through this by hand and tidy up.

I have tried a lot of FOSS PDF editors on this particular PDF, but none of them work as well as an old copy of Foxit PhantomPDF (an old version of the product currently named Foxit PDF Editor) that I have on a dying laptop.

I've tried the following commonly recommended FOSS PDF editors without much success:

LibreOffice Draw - Many text fields in wrong layer order. Page images not visible.

PDFEdit - Loads the file as blank

Scribus - Won't load the file

FireFox - only allows annotation changes

Inkscape - It sort of works, but it's not oriented towards text editing, so looking and editing text is cumbersome.

top 5 comments

sorted by: hot top controversial new old

[-] sibloure@beehaw.org 6 points 10 months ago

I have had good results with Tesseract. I had to export the PDF to individual jpegs, then batch OCR'd them with tesseract, then merged the individual pages back into a single PDF. If you don't want to use command line and are okay with it not being open source, PDF24.org does a good job and does not charge.

[-] Stubborn9867@lemmy.jnks.xyz 5 points 10 months ago* (last edited 10 months ago)

If you want to host it locally, Stirling PDF can be run in docker, and uses a library that uses Tesseract. Has a bunch of other handy PDF operations, too. I keep it around for the two times a year I need to merge, split, or decrypt PDFs.

https://github.com/Frooodle/Stirling-PDF/blob/main/HowToUseOCR.md

It can do it straight from PDF and do multiple files at a time.

[-] sibloure@beehaw.org 1 points 10 months ago

This is amazing. Did not realize it existed. Thank you for sharing

[-] walthervonstolzing@lemmy.ml 4 points 10 months ago* (last edited 10 months ago)

Another vote for Tesseract -- just to clarify the terminology, though: PDF is a fragile format best used read-only; so you really don't want to edit a pdf, but make a new one using the same (or cleaned-up) bitmaps and a new ocr text layer.

Now, tesseract is excellent at recognizing glyphs; but especially if the scanned image is a little fuzzy, the layout detection falters; and when it falters, you get redundant line breaks, & chunks of text in the wrong order -- all of which gets incredibly annoying for searching & copying purposes. So if you can spare the time, and the text requires it, you may need to mark regions (paragraphs & titles mainly) on the bitmap image manually. There exist a few frontends to Tesseract that help with a task like that; check out, e.g., https://github.com/manisandro/gImageReader - inside single paragraph blocks of text, Tesseract doesn't get as easily confused; and the text output is in the correct reading order, & w/o redundant breaks.

[-] algorithmae 1 points 10 months ago

Unfortunately, that's kind of the state of things. From what I know, editing PDFs requires expensive licensing from Adobe and that's why you have to pay for most editors.

Have you tried recreating the faulty page, then printing as PDF and splicing it in with a free web PDF recombiner like ilovepdf?

this post was submitted on 21 Dec 2023

13 points (100.0% liked)

Free and Open Source Software

17848 readers

35 users here now

If it's free and open source and it's also software, it can be discussed here. Subcommunity of Technology.

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 2 years ago

MODERATORS

Gaywallet@beehaw.org

alyaza@beehaw.org