[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Puzzling PDF

On 02/16/2014 05:29 PM, Emile van Sebille wrote:
> You
> On 2/16/2014 6:00 AM, F.R. wrote:
>> Hi all,
>> Struggling to parse bank statements unavailable in sensible
>> data-transfer formats, I use pdftotext, which solves part of the
>> problem. The other day I encountered a strange thing, when one single
>> figure out of many erroneously converted into letters. Adobe Reader
>> displays the figure 50'000 correctly, but pdftotext makes it into
>> "SO'OOO" (The letters "S" as in Susan and "O" as in Otto). One would
>> expect such a mistake from an OCR. However, the statement is not a scan,
>> but is made up of text. Because malfunctions like this put a damper on
>> the hope to ever have a reliable reader that doesn't require
>> time-consuming manual verification, I played around a bit and ended up
>> even more confused: When I lift the figure off the Adobe display (mark,
>> copy) and paste it into a Python IDLE window, it is again letters (ascii
>> 83 and 79), when on the Adobe display it shows correctly as digits. How
>> can that be?
> I've also gotten inconsistent results using various pdf to text 
> converters[1], but getting an explanation for pdf2totext's failings 
> here isn't likely to happen.  I'd first try google doc's on-line 
> conversion tool to see if you get better results.  If you're lucky 
> it'll do the job and you'll have confirmation that better tools 
> exist.  Otherwise, I'd look for an alternate way of getting the bank 
> info than working from the pdf statement.  At one site I've scripted 
> firefox to access the bank's web based inquiry to retrieve the new 
> activity overnight and use that to complete a daily bank reconciliation.
> HTH,
> Emile
> [1] I wrote my own once to get data out of a particularly gnarly EDI 
> specification pdf.

Emile, thanks for your response. Thanks to Roy Smith and Alister, too.

pdftotext has been working just fine. So much so that this freak 
incident is all the more puzzling. It smacks of an OCR error, but where 
does OCR come in, I wonder. I certainly suspected that the font I was 
looking at had fives and zeroes identical to esses and ohs, 
respectively, but the suspicion didn't hold up to scrutiny. I attach a 
little screen shot: At the top, the way it looks on the statement. Next, 
two words marked with the mouse. (One single marking, doesn't color the 
space.) Ctl-c puts both words to the clip board. Ctl-v drops them into 
the python IDLE window between the quotation marks. Lo and behold: 
they're clearly different! A little bit of code around displays the 
ascii numbers. Isn't that interesting?


No matter. You're both right. There are alternatives. The best would be 
to get the data in a CSV format. Alas, I am so lightweight a client that 
banks don't even bother to find out what I am talking about.

I know how to access web pages programmatically, but haven't gotten 
around to dealing with password-protected log-ins and to sending such 
data as one writes into templates interactively.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: pdf-weirdness.gif
Type: image/gif
Size: 10726 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20140216/4960d1ae/attachment.gif>