osdir.com
mailing list archive F.A.Q. -since 2001!



Subject: Re: What's cooking in git.git (Jul 2009, #01; Mon,
06) - msg#00302

List: git

Mail Archive Navigation:
by Date: Prev Next Date Index by Thread: Prev Next Thread Index



On Mon, 6 Jul 2009, Junio C Hamano wrote:
>
> * lt/read-directory (Fri May 15 12:01:29 2009 -0700) 3 commits
> - Add initial support for pathname conversion to UTF-8
> - read_directory(): infrastructure for pathname character set
> conversion
> - Add 'fill_directory()' helper function for directory traversal
>
> Before adding the real "conversion", this needs a few real fixups, I
> think. For example there is one hardcoded array that is used without
> bounds check.

Hmm. I'm not sure what array you're talking about (the newpath/newbase
ones? We do protect against PATH_MAX, it's just that we protect against it
in the "previous iteration").

The bigger issue, though, is that I spent half a day looking more at this
series last Thursday, and I've got some improvements, but getting "all the
way" turns out to be really quite painful.

Why?

We have a _lot_ of code that does "lstat()" on pathnames, and it all
basically uses the internal git representation of the pathname. In
particular, we do this a lot for index lookups, but it's true in other
cases too (example: things like tree merging, where we check whether a
file exists in the working tree).

To test this all out, I actually fleshed out the patches to the point
where I could do

[core]
PathEncoding = Latin1

and actually have the working tree use Latin1 encoding, and convert
internally in git to UTF-8, and have a working "git add ."

However, "git add ." was just about the only thing that I made do the
right thing. Even doing a simple "git diff" afterwards would then show the
file as deleted, because the UTF-8 version of the file (that the index
contained) didn't exist in the filesystem. I fixed that with a hack, but
it basically turns out to be pretty damn ugly, and there's a _lot_ of
those places.

So, the question is, "What now?"

There's a few alternatives:

(a) don't do any of this crap at all. What git does right now works fairly
well for most people. Instead, perhaps worry about just the crazy
case-insensitive filesystems, which are a totally separate issue.

End result: git will always have problems with the crazy NFD format
that OS X uses. Mixing git archives across OS X and other saner
operating systems (and in this context, Windows really does count as
"saner" - it really is OS X that is braindamaged!) will be painful if
you have odd characters in your working tree.

This is the simplest approach, of course. The case-insensitivity is
still not trivial, but we could work on it, and it really is a
different problem (and has none of the "if you look the file up with a
converted name, you cannot see it" issues that the Latin1<->UTF8
example had).

(b) Forget about the general case (like Latin1) that needs two-way
conversion. Just worry about OS X being crazy, and do the NFD->NFC
translation, which only needs to be done one way (because OS X will
still accept and recognize NFC characters, so the "converted" path is
still seen as valid by 'lstat()' and friends).

This is very much just a special case of handling filesystems that are
UTF-8, but are confused about what "equivalent" and "identical" means,
and where the filesystem designer was a moron on some seriously crazy
drugs, and thought that equivalence means identity, and thought that
NFD is a sane form to expose.

This is a much simpler case than the general approach. I don't have OS
X to test with, though, and so far it hasn't appeared that any OS X
people really care about to actually implement it. So I can fix up my
series to a certain point, but will never be able to really do the
final testing and tuning. At least with the full "treat filesystem as
Latin1 encoding", I could _test_ it.

(c) Try to bite the bullet. I can do this, but it really is going to be a
_very_ invasive patch-series, and it will probably involve some nasty
changes to the index format (for performance, we'll likely have to
change the index to have _both_ the "git filename", and the
"filesystem filename" in it).

This was what I wanted to do, and it's what you'd need to do if you do
things like Latin1 filesystem trees or ones where pathnames are done
with shift-JIS encoding or if we want to actually use the (crazy)
native Windows UCS filesystem accessors or whatever.

But I have to admit that after looking at the pain, I'm not at all
convinced it's worth it. Do we ever want to say "git supports
filesystems with shift-JIS encoding"? Do people really care deeply
enough about non-utf filesystems that they'd be willing to live with a
_lot_ of pretty nasty complexity, and some real performance overhead?

I have to say, even with plain UTF-8, git isn't really a pleasure to use.
While I did my Latin1 test, I used filenames like "åäö" (the three extra
Finnish/Swedish characters), and if you do this

mkdir test-repo
cd test-repo
git init
echo testfile > åäö
git add .
git ls-files

the end result is not actually really usable. We quote it to a binary
mess, rather than showing "åäö". Our pathname quoting is trying to be
safe, which is good, but it does mean that right now, odd characters
aren't very friendly even _if_ you are using a sane filesystem, and all
plain NFC utf-8.

So right now, my personal opinion is:

- let's just face the fact that the only sane filename representation is
NFC UTF-8. Show filenames as UTF-8 when possible, rather than quoting
them.

- Do case (b) above: add support for converting NFD -> NFC at readdir()
time, so that OS X people can use UTF-8 sanely.

- add a "binary encoding" mode to filesystems that actually use Latin1,
just so that if people use Latin1 or Shift-JIS filesystem encodings, we
promise that we'll never munge those kinds of names.

- Maybe we'd make the "binary encoding" (which is effectively existing
git behavior) be the default on non-OSX platforms.

but that's just my gut feel from trying to weigh the costs of trying to do
something more involved against the costs of OS X support and just letting
crazy encodings exist in their own little worlds. So a development group
that uses Shift-JIS (or Latin1) would be able to work internally with git
that way, but would not be able to sanely work with the world at large
that uses UTF-8.

Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html

Thread at a glance:

Previous Message by Date:

Re: Git fast-import file format question

Troy Telford <ttelford.groups@xxxxxxxxx> wrote: > When I was looking at the man page for git-fast-import (and looking at > the output of git-fast-export for a valid git repo), I noted the > "from" directive. There's a clip there that tells me that "Omitting > the from command on existing branches is usually desired, as the > current commit on that branch is automatically assumed to be the first > ancestor of the new commit." That section meant "existing branches within this import session". To restart an import, you need to use the from command in the first commit of that session, e.g. to restart an import on refs/heads/master use: from refs/heads/master^0 -- Shawn. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html

Next Message by Date:

Re: SchrÃdinger's diff

On Tue, Jul 07, 2009 at 01:36:08PM -0400, Daniel Barkalow wrote: > > # I realize that switching is ill-advised, but I'm > > # trying to track down a possibly related problem... > > git config core.autocrlf true > > > > # This sometimes produces output and sometimes it doesn't. > > # Either way rerunning just git-diff always gives the same result > > # as the first run in this repo. > > git diff > > If git knows the file hasn't been modified, it doesn't produce a diff. > > If it doesn't know the file hasn't been modified, it looks at the actual > contents and it find that the result of reading the disk applying autocrlf > now doesn't match the contents of the index. Yes, that was my analysis upon reading the original mail, as well (and I have been bitten by this before while testing crlf stuff). The same thing can happen with clean/smudge, I think. When you set up config that changes how we view worktree files (like crlf or clean/smudge) and there is already cached stat information in the index, you really need to invalidate the matching stat information in the index to get sane results[1]. It might be nice for "git config" to do this for you, but: 1. You could just as easily be hand-editing the config. 2. It feels wrong from a modularity standpoint. Right now "git config" doesn't actually care about the semantics of config, just the syntax. Which makes it exactly equivalent to hand-editing. 3. It doesn't cover every situation. Files can also be "changed" in this way by editing .gitattributes, which can be changed manually or by any number of git commands (like checkout, reset, etc). So I think automatically detecting this situation would require flags in the index to say "this stat information is valid only over these particular settings". And you would want it per-file to avoid having to re-hash every file when you change the .gitattributes for one file. The command using the index would check it. But even that might have holes, I'm afraid -- we don't always look at all of the config in every command, though perhaps we do for such core functionality. -Peff [1] Is there an easy way to do this with update-index? I didn't see one, and had to resort to "git read-tree HEAD". -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html

Previous Message by Thread:

Re: What's cooking in git.git (Jul 2009, #01; Mon, 06)

Junio C Hamano schrieb: > * ne/futz-upload-pack (Wed Jun 10 01:50:18 2009 +0200) 1 commit > - Shift object enumeration out of upload-pack I'm interested in this one because it is a step towards improved behavior of upload-pack on Windows if the repository is corrupted[*]. This patch covers the common case where shallow clones are out of the game, but it is not ready for prime time until its implementation is complete. IIUC, this should be a fall-out of a GSoC project. Until then I include it in my git. [*] One test case in t5530 still fails on Windows, because for some reason errors are not reported correctly. It has to do with the rev-list being run in a thread and that thread die()s. -- Hannes -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html

Next Message by Thread:

Re: What's cooking in git.git (Jul 2009, #01; Mon, 06)

On Tue, Jul 7, 2009 at 21:17, Linus Torvalds<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > So right now, my personal opinion is: > > Â- let's just face the fact that the only sane filename representation is > Â NFC UTF-8. Show filenames as UTF-8 when possible, rather than quoting > Â them. > > Â- Do case (b) above: add support for converting NFD -> NFC at readdir() > Â time, so that OS X people can use UTF-8 sanely. > > Â- add a "binary encoding" mode to filesystems that actually use Latin1, > Â just so that if people use Latin1 or Shift-JIS filesystem encodings, we > Â promise that we'll never munge those kinds of names. > > Â- Maybe we'd make the "binary encoding" (which is effectively existing > Â git behavior) be the default on non-OSX platforms. > > but that's just my gut feel from trying to weigh the costs of trying to do > something more involved against the costs of OS X support and just letting > crazy encodings exist in their own little worlds. So a development group > that uses Shift-JIS (or Latin1) would be able to work internally with git > that way, but would not be able to sanely work with the world at large > that uses UTF-8. Maybe we could at least let the user save the encoding of file names in the tree objects somehow? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
blog comments powered by Disqus

Home | News | Sitemap | FAQ | advertise | OSDir is an Inevitable website. GBiz is too!