|
Re: pdfs/ OCR question: msg#00088science.linguistics.corpora
That depends on how the PDF was created: > interesting to know that pdf files store text info separately! Some PDF files are generated by scanning each page of a book or article into an image format (GIF or TIFF, for example). In such a PDF file, there are no character strings internally, and some kind of OCR is necessary to convert the image into a character string. The OCR process might convert an image for "the" into the character string "die". But if the PDF file had been generated from a text string in any textual form, such as HTML, LaTeX, TXT, ODT, or DOC formats, the internal PDF file preserves the original text strings. If you copy and paste text from a PDF of that kind into an editor for some other kind of text, such as OpenOffice or MS Word, you will get a copy of the original character string, but some or all of the formatting info may be lost. That process would never convert "the" into "die". There are some caveats, however. Some PDF files may have special characters for ligatures, such as fi, fl, ff, etc. Even though the ligatures are represented in character strings, a copy & paste from such files to another editor may convert the ligature to an unrecognized character. (Some OCR systems also have difficulty with ligatures because the letters "f" and "i" or "l" are too close together for easy recognition.) John Sowa |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: pdfs/ OCR question: 00088, Klaus Guenther |
|---|---|
| Next by Date: | Re: pdfs/ OCR question: 00088, Brett Powley |
| Previous by Thread: | Re: pdfs/ OCR questioni: 00088, Klaus Guenther |
| Next by Thread: | Re: pdfs/ OCR question: 00088, Brett Powley |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |