[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Python-Dev] Issue?


I wrote Python's sort, so I may know something about it ;-)  To my
eyes, no, there's not "an issue" here, but a full explanation would be
very involved.

For some sorting algorithms, it's possible to guarantee a redundant
comparison is never made.  For example, a pure insertion sort.

But Python's sort is quite complex, using a number of distinct
strategies dynamically adapting to structure found in the data as it
goes along.  There's a possibility for a small bit of redundant work
whenever switching from one strategy to another.  That's typical of
"hybrid" strategies.

Your case involves a switch between Python's sort's two simplest
strategies.  Let's just look at what matters:

[22, 33, 11, 11]

The first step is searching for "a natural run", an initial
sub-sequence that's already sorted.  22 < 33 says the first two
elements are already in order.  Then 11 < 33 says that's the end of
the run.

That part is over now.

The next step is using a binary insertion sort to move 11 into the
already-sorted [22, 33] prefix.  A binary search starts looking "in
the middle" first, but a 2-element run is so short that 33 "is the
middle", so it first compares 11 to 33 _again_.  So it goes.  Code to
detect that special case could be added, but it would probably cost
more than it saves (it always costs time to test for it, but that time
is pure waste unless the tests succeed).

This is specific to an ascending natural run of length 2, so is quite
a special case.  For example, in

[22, 33, 44, 11, 11]

11 < 44 ends the natural run, and then the binary search first
compares 11 to 33 ("the middle" of the 3-element natural run), and no
redundant comparison gets made.  But in

[22, 33, 44. 43, 11]

43 gets compared to 44 again on the _second_ iteration of the binary
search loop.

In general, this particular strategy switch ends up doing at most one
redundant comparison only if the first element after an ascending
natural run belongs immediately before the last element of the run.
For technical reasons, this can happen at most

    min(len(the_list) / 32, number_of_natural_runs_in_the_list)

times, so it's generally trivial in an average-case O(N log N)
process.  It's so rare it would be minor in an O(N) process too,
unless N is very small.

A principled way to avoid this would be to skip the "find a run" step
if N is very small, leaving the whole thing to binary insertion sort.
Then a redundant comparison would never be done for small N.  But that
would end up doing more comparisons _overall_ in common cases where a
short list starts with a relatively (compared to N) sizable ascending
or descending run.

So I'm happy enough with the tradeoffs already in place.


On Wed, Aug 22, 2018 at 10:37 AM ??? <1520306395 at qq.com> wrote:
> Why compare twice?
>
> class Student(object):
>
>     def __init__(self, name, score):
>         self.name = name
>         self.score = score
>
>     def __str__(self):
>         return '(%s: %s)' % (self.name, self.score)
>
>     __repr__ = __str__
>
>     def __lt__(self, s):
>         #print(self, '--', s)
>         if(self.score<s.score):
>             print(self, '<', s)
>             return True
>         if(self.score>s.score):
>             print(self, '>', s)
>             return False
>         if(self.score==s.score):
>             if(self.name>s.name):
>                 print(self, '>', s)
>                 return False
>             if(self.name<s.name):
>                 print(self, '<', s)
>                 return True
>             if(self.name==s.name):
>                 print(self, '==', s)
>                 return False
>
>     def __eq__(self, s):
>         return (self.score == s.score) and (self.name == s.name)
>     def __gt__(self, s):
>         return not ((self == s) or (self < s))
>     def __le__(self, s):
>         return ((self == s) or (self < s))
>     def __ge__(self, s):
>         return ((self == s) or (self > s))
>     def __nq__(self, s):
>         return not (self == s)
>
> L = [Student('Tim', 22), Student('Bob', 33), Student('Kevin', 11), Student('Alice', 11)]
> print(sorted(L))
>
> Output:
> (Bob: 33) > (Tim: 22)
> (Kevin: 11) < (Bob: 33)
> (Kevin: 11) < (Bob: 33)
> (Kevin: 11) < (Tim: 22)
> (Alice: 11) < (Tim: 22)
> (Alice: 11) < (Kevin: 11)
> [(Alice: 11), (Kevin: 11), (Tim: 22), (Bob: 33)]
>
> Best regards,
> Xiaofeng
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/tim.peters%40gmail.com