|
Re: pdfs/ OCR question: msg#00087science.linguistics.corpora
Recent versions of Acrobat have internal OCR functionality. Apparently each page is analyzed separately and systematically processed in several steps. Character recognition occurs before words are built. Depending on which language was set as the OCR language (you can select several - and from the error you found, it might be that it was processed for German), you'll get different errors. You can always at any time rerun OCR on the page (provided you have Acrobat Standard or higher [1]). If you export the page images to TIFF format (lossless), you can run them through any OCR program, including the one provided as part of Microsoft Office (Microsoft Document Imaging). I am currently unaware of any software that will clean up such errors, but Office 2007 imaging software may have some of that functionality built in, due to the fact that Word 2007 has probabilistic error detection. It's just a suspicion, and would have to be verified. Maybe the Microsoft people on this list would be able to help. Best, Klaus [1] http://www.adobe.com/products/acrobat/matrix.html --- Klaus Guenther Graduate Assistant Chair of English Linguistics University of Bamberg, Germany Hunter, Duncan wrote: Thanks for this Alexandre. |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | RE: pdfs/ OCR question: 00087, Hunter, Duncan |
|---|---|
| Next by Date: | Re: pdfs/ OCR question: 00087, John F. Sowa |
| Previous by Thread: | RE: pdfs/ OCR questioni: 00087, Hunter, Duncan |
| Next by Thread: | Re: pdfs/ OCR question: 00087, John F. Sowa |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |