|
Re: pdfs/ OCR question: msg#00089science.linguistics.corpora
There are several issues with extracting text from PDF files: - Scanned and OCRed documents, as has been mentioned, often have the scanned images of the original plus a text 'layer' to be used for copying-and-pasting. Not all documents have this text layer, however. - In some senses, it can be said that PDF 'preseves the original text strings'. However, PDF wasn't designed for recovery of the original text; it was designed for faithful rendering on screen or on a printer. Frequently, spaces are missing from text in the PDF file -- for rendering, this doesn't matter, since the characters simply need to be drawn in the correct place. However, for text extraction, the presence of spaces often has to be inferred from the position of surrounding characters. Line breaks are never present, and again must be inferred from text placement. The sequence of text in the PDF document may not be the same sequence as in the original file, since sequence is irrelevant to rendering. And so on... - Some PDF files use font subsets with custom encodings -- they have a table at the beginning of the file with codes and the glyphs to render for each code; however, these codes aren't in ASCII or UTF-8 or anything recognisable. When you extract text from such a file, you generally get junk. There are a few tools around for extracting text from PDF files -- PDFBox and Multivalent are two open source tools that I've used that perform pretty well. Good luck! Brett Powley On 12/12/2006, at 2:31 PM, John F. Sowa wrote: That depends on how the PDF was created: -------------------------------------------------------------- Brett Powley -- PhD Candidate Centre for Language Technology, Macquarie University, Australia w: http://www.ics.mq.edu.au/~bpowley faciendi plures libros nullus est finis frequensque meditatio carnis adflictio est -------------------------------------------------------------- |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: pdfs/ OCR question: 00089, John F. Sowa |
|---|---|
| Next by Date: | Re: pdfs/ OCR question: 00089, Jakub Marecek |
| Previous by Thread: | Re: pdfs/ OCR questioni: 00089, John F. Sowa |
| Next by Thread: | Re: pdfs/ OCR question: 00089, Jakub Marecek |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |