logo       

Re: [Python-Dev] standard library mimetypes module pathologically broken?: msg#00664

python-dev

Subject: Re: [Python-Dev] standard library mimetypes module pathologically broken?



On Fri, Jul 31, 2009 at 14:16, Jacob Rus <jacobolus@xxxxxxxxx> wrote:
Hi all,

In an attempt to figure out some twisted.web code, I was reading
through the Python Standard Libraryâs mimetypes module today, and
was shocked at the poor quality of the code. I wonder how the
mimetypes code made it into the standard library, and whether anyone
has ever bothered to read it or update it: it is an embarrassment.
Much of the code is redundant, portions fail to execute, control
flow is routed through a horribly confusing mess of spaghetti, and
most of the complexity has no clear benefit as far as I can tell. I
probably should drop the subject and get back to work, but as a good
citizen, itâs hard to just ignore this sort of thing.

I have not looked at the code nor ever used it (that I can remember) so I can't directly address the quality. But I can say the code was added in 1997 which puts it as an addition in Python 1.4. That's why before Python took off mainstream and began to tighten up the quality control on the standard library.

I also would like to stay that I am not embarrassed by anything in Python. It's unfortunate if the mimetypes module's code is a mess, but I think putting at embarrassing is taking a little far and borderline insulting (which I don't think you meant to do).
Â

mimetypes.py stores its types in a pair of dictionaries, one for
"strict" use, and the other for "non-standard types". It creates the
strict dictionary by default out of apache's mime.types file, and
then overrides the entries it finds with a set of exceptions. Then
it creates the non-standard dictionary, which is set to match if the
strict parameter is set to False when guessing types. Just in this
basic design, and in the list of types in the file, there are
several problems:

Â* Various apache mime types files are read, if found, but the
 Âordering of the files is such that older versions of apache are
 Âsometimes read after newer ones, overriding updated mime types
 Âwith out-of-date versions if multiple versions of apache are
 Âinstalled on the system.

Â* The vast majority of types declared in mimetypes.py are
 Âduplicates of types already declared by Apache. In a few cases
 Âthis is to change the apache default (make an exception, that
 Âis), but in most cases the mime type and extension are
 Âcompletely identical. This huge number of redundant types makes
 Âthe file substantially harder to follow. No comments are
 Âprovided to explain why various sets of exceptions are made to
 ÂApache's default mime types, and in several cases mimetypes.py
 Âseems to just be out of date as compared to recent versions of
 ÂApache, for instance not knowing about the 'text/troff' type
 Âwhich was registered in January 2006 in RFC 4263.

Â* The 'non-standard' type dictionary is nearly useless, because
 Âall of the types it declares are already in apache's mime.types
 Âfile, meaning that types are, as far as I can tell trying to
 Âfollow ugly program flow, *never* drawn from the non-strict
 Âdictionary, except in the improbable situation where the
 Âmimetypes module is initialized with a custom set of
 Âapache-mime.typesâlike files, which does not include those
 Â'non-standard' types. I personally cannot see a use case for
 Âinitializing the module with a custom set of mime types, but
 Âthen leaving the very few types included as non-strict to the
 Âdefaults: this seems like a fragile and pathological use case.
 ÂGiven this, I donât see any benefit to dragging the 'strict'
 Âparameter along all the way through the code, and would advise
 Âgetting rid of it altogether. Does anyone know of any code that
 Âuses the mimetypes module with strict set to False, where the
 Ânon-strict code path ever *actually* is executed?

But though these problems, which affect actual use of the code and
are therefore probably most important, are significant, they really
pale in comparison to the awful quality of implementation. I'll try
to briefly outline my understanding of how code flows in
mimetypes.py, and what the problems are. I haven't stepped through
the code in a debugger, this is just from reading it, so I apologize
in advance if I get something wrong. This is, however, some of the
worst code Iâve seen in the standard library or anywhere else.

Â* It defines __all__: I didnât even realize __all__ could be used
 Âfor single-file modules (w/o submodules), but it definitely
 Âshouldnât be here.

__all__ is used to control what a module exports when used in an import *, nothing more. Thus it's use in a module compared to a package is completely legitimate.
Â
This specific __all__ oddly does not include
 Âall of the documented variables and functions in the mimetypes
 Âclass. Itâs not clear why someone calling import * here wouldnât
 Âwant the bits not included.

If something is documented by not listed in __all__ that is a bug.
Â


Â* It creates a _default_mime_types() function which declares a
 Âbunch of global variables, and then immediately calls
 Â_default_mime_types() below the definition. There is literally
 Âno difference in result between this and just putting those
 Âvariables at the top level of the file, so I have no idea why
 Âthis function exists, except to make the code more confusing.

It could potentially be used for testing, but that's a guess.
Â

Â* It allows command line usage: I donât think this is necessary
 Âfor a part of the standard library like this. There are better
 Âtools for finding mime types from the command line which ship
 Âwith most operating systems.

Yeah, various modules have command-line versions which are not truly necessary. This can probably stand to go.
Â


Â* Its API is pretty poorly designed. It offers 6 functions when
 Âabout 3 are needed, and it takes a couple reads-through of the
 Âcode to figure out exactly what any of them are supposed to do.

Â* The operation is crazy: It defines a MimeTypes class which
 Âactually stores the type mappings, but this class is designed to
 Âbe a singleton. The way that such a design is enforced is
 Âthrough the use of the module-global 'init' function, which
 Âmakes an instance of the class, and then maps all of the
 Âfunctions in the module global namespace to instance methods.
 ÂBut confusingly, all such functions are also defined
 Âindependently of the init function, with definitions such as:

   Âdef guess_type(url, strict=True):
     Âif not inited:
       Âinit()
     Âreturn guess_type(url, strict)

 ÂIâd be amazed if anyone could guess what that code was trying to
 Âdo. I did a double-take when I saw it.

Probably came from someone who is very OO happy. Not everyone comes to Python ready to embrace its procedural or slightly functional facets.
Â

 ÂOf course, that return call is only ever reached the first time
 Âthis function is called, if init() has not happened yet. This
 Âwas all presumably done for lazy initialization, so that the
 Âtype information would only be loaded when needed. Needless to
 Âsay, there are more pythonic ways to accomplish such a goal.

 ÂOh, also, the other good one here is that it means that someone
 Âwho writes `from mimetypes import guess_types` gets something
 Âdifferent than someone who writes:
 Â`import mimetypes; guess_types = mimetypes.guess_types`. In the
 Âformer case, this wrapper function is saved as guess_type, which
 Âeach time just calls the (changed after init())
 Âmimetypes.guess_types function. This caused a performance
 Ânightmare before March of this year, when there was no check for
 Â`if not inited` before running init() (amazing!?).

Â* Because the type datastore is set up to be a singleton, any time
 Âinit() is called in one section of code, it resets any types
 Âwhich have been added manually: this means that if init() is
 Âcalled by different pieces of code in the same python program,
 Âthey will interfere with each-othersâ type databases, and break
 Âeach-other. This is extremely fragile and, in my opinion, crazy.
 ÂIt is hard for me to imagine any use case that would benefit
 Âfrom this ability to clobber custom type mappings, and I very
 Âmuch doubt that any code calling the mimetypes module realizes
 Âthat the contract of the API is so flimsy by definition. In
 Âpractice, I would not advise consumers of this API to ever call
 Âinit() manually, or to ever add custom mime type mappings,
 Âbecause they are setting themselves up for hard-to-track bugs
 Âdown the line.

Â* The 'inited' flag is a documented part of the interface, in the
 Âstandard library documentation. I cannot imagine any reason to
 Âset this flag manually: setting it to false when it was true
 Âwill have no effect, because the top-level functions have
 Âalready been replaced by instance methods of the 'db' MimeTypes
 Âinstance. Setting it to true when it was false will make the
 Âcode just break outright.

Â* In python 3, this has been changed a bit. Thereâs still an
 Âinited flag, and it still in the docs, but now awful code from
 Âabove has been changed slightly, to:

   Âdef guess_type(url, strict=True):
     Âif _db is None:
       Âinit()
       Âreturn _db.guess_type(url, strict)

 ÂWhich is still embarrassingly confusing. On the upside, the
 Âinited flag now does literally nothing, but remains defined, and
 Âin the docs.

Â* The 'types_map' and 'common_types' (for 'strict' and
 Â'common' types, respectively) dictionaries are also a documented
 Âpart of the interface. When init() is called, a new MimeTypes
 Âinstance makes a (different) types_map which is a tuple of two
 Âdictionaries, for 'strict' and 'common' types. Then this
 Âinstance reads the apache mime.types files and adds the types to
 Âits pair of self.types_map dictionaries, and then after that
 Âlooks at the global types_map and common_types dictionaries and
 Âadds *those* types to its self.types_map. Then at the end it
 Âreplaces the global types_map with self.types_map[True] and
 Âreplaces common_types with self.types_map[False]. Unfortunately,
 Âwhile changing these dictionaries will have an effect on the
 Âoperation of the library, it will not update the types_map_inv
 Âmapping, so inverse lookups will not behave as the changer
 Âexpects. If these dictionaries are going to remain documented,
 Âthe documentation should be clear to describe them as read only
 Âto avoid very confusing bugs.

Â* Speaking of these dictionaries, .copy() is called on those two
 Âand a few other inside MimeTypes.__init__(), which happens every
 Âtime the global init() function is called, but then init() puts
 Âthe copies back in the global namespace, meaning that the
 Âoriginal is discarded. Basically the only reason for the .copy()
 Âis to make sure that the correct updates are applied to the
 Âapache mimetype defaults, but the code will gladly re-read all
 Âof the apache files even after its mapped types are already in
 Âthese dictionaries, essentially making re-initializing a (very
 Âexpensive) no-op. All weâre doing is a lot of unnecessary extra
 Âdisk reads and memory allocations and deallocations. The only
 Âtime this has any effect is when a non-singleton MimeTypes
 Âinstance is created, as in the read_mime_types function.

Â* And that read_mime_types function is a doozy. It tries to open a
 Âfilename, spits back None if thereâs an IOError (instead of
 Âraising the exception as it should), and then creates a new
 ÂMimeTypes instance (remember, this is identical to the singleton
 ÂMimeTypes instance because it starts itself from that oneâs
 Âmappings), adds any new types it finds in the file with that
 Âname, and then returns the 'strict' types_map from it. Iâm not
 Âsure whether any sane user of this API would expect it to return
 Âthe existing type mappings *plus* the extra ones in the provided
 Âfilename, but I really canât imagine this function ever being
 Âparticularly useful: it requires you are reading mime types in
 Âapache format, but not the apache mime type files you already
 Âlooked at, and then the only way to find out what new mappings
 Âwere defined is to take the difference of the default mappings
 Âwith the result of the function.

Â* The code itself, on a line-by-line basis, is unpythonic and
 Âunnecessarily verbose, confusing, and slow. The code should be
 Ârewritten to use python 2.3â2.6 features: even leaving its
 Âfunctionality identical it could be cut to about half the number
 Âof lines, and made clearer.

In case the above doesnât make this clear: this code is extremely
confusing.

Yeah, kind of picked up on that. =)
Â
Trying to read it has caused all the people around me to
look up as I shout "what the fuck??!" at the screen every few
minutes, as each new revelation gives another surprise. Iâm not
convinced that I completely understand what the code does, because
it has been quite effectively obfuscated, but I understand enough to
want to throw the whole thing out, and start essentially from
scratch.

So the question is, what should be done about this? Iâd like to hear
how people use the mimetypes module, and how they expect it to work,
to figure out the sanest possible mostly-backwards-compatible
replacement which could be dropped in (ideally this would just allow
the use of default mimetypes and rip out the ability to alter the
default datastore: or is there some easy way to change this away
from a singleton without breaking code which calls these methods?),
and then extend that replacement to support a somewhat saner model
for anyone who actually wants to extend the set of mappings. My
guess is that replacement code could actually fix subtle bugs in
existing uses of this module, by people who had a sane expectation
of how it was supposed to work.

At the very least, the parts about figuring out exactly which
exceptions to Apacheâs set of default types are useful would be a
good idea, and Iâd maybe even recommend including an up-to-date copy
of Apacheâs mime.types file in the Python Standard Library, and then
only overriding its definitions for future versions of Apache (and
then overriding the combination of both of those with further
exceptions deemed useful for python, with comments explaining why
each exception), so that weâre not bothering to look up horribly
out-of-date types in multiple locations from Apache 1, 1.2, 1.3,
etc. Iâd also recommend making the API for overriding definitions be
the same as the code used to declare the default overrides, because
as it is there are three ways do define types: a) in a mime.types
formatted file, b) in a python dictionary that gets initialized with
a confusing bit of code, and c) through the add_type function.

Does anyone else have thoughts about this, or maybe some good (it
had better be *really* good) explanations why this code is the way
it is? I'd be happy to try to rewrite it, but I think Iâd need a bit
of help figuring out how to make the rewrite backwards-compatible.

So the problem of changing fundamentally how the code works, even for a cleanup, is that it will break someone's code out there because they depended on the module's crazy way of doing things. Now if they are cheating and looking at things that are meant to be hidden you might be able to clean things up, but if the semantics are exposed to the user, then there is not much we can do w/o breaking someone's code.

Honestly, if the code is as bad as it seems -- including its API --, the best bet would be to come up with a new module for handling MIME types from scratch, put it up on the Cheeseshop/PyPI, and get the community behind it. If the community picks it up as the de-facto replacement for mimetypes and the code has settled we can then talk about adding it to the standard library and begin deprecating mimetypes.

And thanks for willing to volunteer to fix this.

-Brett
_______________________________________________
Python-Dev mailing list
Python-Dev@xxxxxxxxxx
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/maillists%40codeha.us
Google Custom Search

News | Mail Home | sitemap | FAQ | advertise