|
pdfs/ OCR question: msg#00082science.linguistics.corpora
Quick question about pdfs/ OCR:
Some text is copied and from a pdf file and pasted into a text or Word file. It contains errors- say, for example, 'the' has become 'die' (you notice that in the original pdf the 't' and 'h' are quite close together). At what stage has this misrecognition/ miscopying occured?
Where does the OCR take place? The OCR functionality is, presumably, part of of the .pdf reader software itself?
Can anything be done to deal with the problem?
Duncan Hunter
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | 2nd Call for Abstracts: Towards a reference corpus of web genres: 00082, santinim\@inwind\.it |
|---|---|
| Next by Date: | Popular non-fiction: 00082, jarmo . jantunen |
| Previous by Thread: | RE: word frequencies on the webi: 00082, Serge Sharoff |
| Next by Thread: | Re: pdfs/ OCR question: 00082, Alexandre Rafalovitch |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |