logo       

Re: pdfs/ OCR question: msg#00089

science.linguistics.corpora

Subject: Re: pdfs/ OCR question

There are several issues with extracting text from PDF files:

- Scanned and OCRed documents, as has been mentioned, often have the scanned images of the original plus a text 'layer' to be used for copying-and-pasting. Not all documents have this text layer, however.

- In some senses, it can be said that PDF 'preseves the original text strings'. However, PDF wasn't designed for recovery of the original text; it was designed for faithful rendering on screen or on a printer. Frequently, spaces are missing from text in the PDF file -- for rendering, this doesn't matter, since the characters simply need to be drawn in the correct place. However, for text extraction, the presence of spaces often has to be inferred from the position of surrounding characters. Line breaks are never present, and again must be inferred from text placement. The sequence of text in the PDF document may not be the same sequence as in the original file, since sequence is irrelevant to rendering. And so on...

- Some PDF files use font subsets with custom encodings -- they have a table at the beginning of the file with codes and the glyphs to render for each code; however, these codes aren't in ASCII or UTF-8 or anything recognisable. When you extract text from such a file, you generally get junk.

There are a few tools around for extracting text from PDF files -- PDFBox and Multivalent are two open source tools that I've used that perform pretty well.

Good luck!

Brett Powley




On 12/12/2006, at 2:31 PM, John F. Sowa wrote:

That depends on how the PDF was created:

> interesting to know that pdf files store text info separately!

Some PDF files are generated by scanning each page of a book or
article into an image format (GIF or TIFF, for example). In such
a PDF file, there are no character strings internally, and some
kind of OCR is necessary to convert the image into a character
string. The OCR process might convert an image for "the"
into the character string "die".

But if the PDF file had been generated from a text string in
any textual form, such as HTML, LaTeX, TXT, ODT, or DOC formats,
the internal PDF file preserves the original text strings. If
you copy and paste text from a PDF of that kind into an editor
for some other kind of text, such as OpenOffice or MS Word, you
will get a copy of the original character string, but some or
all of the formatting info may be lost. That process would
never convert "the" into "die".

There are some caveats, however. Some PDF files may have
special characters for ligatures, such as fi, fl, ff, etc.
Even though the ligatures are represented in character strings,
a copy & paste from such files to another editor may convert
the ligature to an unrecognized character. (Some OCR systems
also have difficulty with ligatures because the letters "f"
and "i" or "l" are too close together for easy recognition.)

John Sowa




--------------------------------------------------------------
Brett Powley -- PhD Candidate
Centre for Language Technology, Macquarie University, Australia
w: http://www.ics.mq.edu.au/~bpowley
faciendi plures libros nullus est finis
frequensque meditatio carnis adflictio est
--------------------------------------------------------------








<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise