[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

The file method read does not "see" CR

On Wed, Dec 11, 2019 at 2:18 AM Stephen Tucker <stephen_tucker at sil.org> wrote:
> Chris,
> Many thanks. The first part of your answer was spot on. I have modified the program, opening its input file in "rb" mode, and now the program is working as I intend it to.


> I am puzzled by the second part, where you say that the read call will return one character, if the file is in text mode.
> I was expecting this statement to read
>    The read call will return one byte if the file is in binary mode.

Both statements are true.

> Given that I was opening the file in "r" mode (which, I am assuming, is "text" mode), then, if your statement is true, I have two more questions:
>    1. If CR is a character and I open the file in text mode, then why does the program not "see" the CR characters?

It's kinda a little messy, but because CR + LF means "line ending" on
Windows, the two-byte unit is generally considered to be a single
logical character "end of line". You can disable this by passing
another keyword argument to the open() call, but in general, text
files care about lines, rather than being bothered by exactly what
sort of line ending they're using.

>    2. Does the program not see CR characters under these circumstances because they do not "count" as characters or for some other reason?
> (You have probably spotted that this question 1 is virtually the same as my original question.)

The Python file handler is interpreting a two-byte sequence as a
single logical unit, and then representing that unit with the single
character "\n". Basically, you're at the tail end of decades of mess
involving line endings, and we're all doing our best to cope with the
morass of craziness that's out in the world. For the most part, what
Python does is what you want; in the situations where it isn't, you
can override it with the parameter.

Be aware that bytes and characters are FAR more different than this.
But if you're confident that your file is encoded ASCII or UTF-8
(which are probably the most common encodings you'll encounter), then
you can at least be certain that the byte value b"\r" corresponds
exactly to the character "\r".