|
Re: pdfs/ OCR question: msg#00084science.linguistics.corpora
I would guess that the OCR had been done by the software that generated the PDF. You might be able to check what it is by looking at PDF document's properties. The text is stored on a separate layer from the image and the reader just does region matching for the selection purposes. If you need to have this fixed, you probably will need to burst out the PDF into its page images and have those re-OCRed. Software you might find useful include PDFBox (http://www.pdfbox.org/) and Gamera (http://ldp.library.jhu.edu/projects/gamera/) You can also look at the Distributed Proofreaders to see if there is anything to be learned from their experience: http://www.pgdp.net/ Regards, Alex. On 12/11/06, Hunter, Duncan <D.I.Hunter@xxxxxxxxxxxxx> wrote: Quick question about pdfs/ OCR: |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Popular non-fiction: 00084, jarmo . jantunen |
|---|---|
| Next by Date: | Re: Google searches as linguistic evidence: 00084, James_L._Fidelholtz |
| Previous by Thread: | pdfs/ OCR questioni: 00084, Hunter, Duncan |
| Next by Thread: | RE: pdfs/ OCR question: 00084, Hunter, Duncan |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |