logo       

[SPOILER] sorted thoughts: msg#00102

Subject: [SPOILER] sorted thoughts
Looking through the solutions posted, it's interesting that most everyone is preprocessing the name list before sorting. I had a thought that it might be faster to only translate or lookup the characters that needed to be compared. For example, if you're comparing 'Squid' with 'Squamuglia', it is only neccessary to translate the first four characters (8 total lookups) of each word. Whereas if you preprocess the the names you end up doing 15 total lookups. The majority of cases /should/ show a similar pattern.

To test this, I constructed a small model to measure operations: lookups and comparisons. It's attached as 16-inst.pl. The instrumented model doesn't actually do anything except go through the motions, taking counts of some operations. For the small sample of names included the results look something like:

PROCESS INLINE:
  number of words: 13
  words compared: 35
  chars compared: 100
  lookups: 200

PREPROCESS LIST:
  number of words: 13
  words compared: 35
  chars compared: 100
  lookups: 338

A larger sample produces an even greater savings in lookups.

To take a closer look, I threw together a sample sorting routine, attached as 16-sort.pl. It probably seems a bit verbose, but this is mainly because it is neccessary to perform some operations explicitly whereas the other solutions allow perl to do the operations under the hood.

There are a couple of issues I still have with this implementation. One, as mentioned, it was neccessary to perform explicitly-in perl code-what the other solutions get implicitly through perls built-ins which are heavily optimized C code. I doubt my iteration over each char for comparison comes anywhere close to the optimized C code in perl's sort routine. Two, each name will be compared multiple times, meaning the same translation/lookup will be performed multiple times for each word. Caching /might/ help.

Just some thoughts,
Randy.
#!/usr/bin/perl

use strict;
use warnings;

use constant PERMUTED_ALPHABET => 'QWERTYUIOPASDFGHJKLZXCVBNM';

our( $num_lookups, $num_word_cmps, $num_char_cmps );

sub i_tr {
    my( $str, $alpha ) = @_;
    $num_lookups += length( $alpha );
    return $str;
}

sub i_sort {
    my( $a, $b ) = @_;

    ++$num_word_cmps;

    my $a_len = length($a);
    my $b_len = length($b);
    my $min_len = $a_len < $b_len ? $a_len : $b_len;

    for my $i ( 0 .. ($min_len-1) ) {
        my $a_ch = substr( $a, $i, 1 );
        my $b_ch = substr( $b, $i, 1 );

        my $rel = $a_ch cmp $b_ch;
        ++$num_char_cmps;

        next if ($rel == 0);
        return $rel;
    }

    return -1 if ($a_len  < $b_len);
    return  0 if ($a_len == $b_len);
    return  1; # if ($a_len >  $b_len);
}

sub i_tr_sort {
    my( $a, $b ) = @_;

    ++$num_word_cmps;

    my $a_len = length($a);
    my $b_len = length($b);
    my $min_len = $a_len < $b_len ? $a_len : $b_len;

    for my $i ( 0 .. ($min_len-1) ) {
        my $a_ch = substr( $a, $i, 1 );
        my $b_ch = substr( $b, $i, 1 );

        # perform lookups
        $num_lookups += 2;

        my $rel = $a_ch cmp $b_ch;
        ++$num_char_cmps;

        next if ($rel == 0);
        return $rel;
    }

    return -1 if ($a_len  < $b_len);
    return  0 if ($a_len == $b_len);
    return  1; # if ($a_len >  $b_len);
}

############################################################
#
# BEGIN

# init

chomp( my @data = <DATA> );
#  -or-
# open( FH, 'words' ) or die;
# chomp( my @data = <FH> );
# close( FH );


##############################
#
# 1st run: Process Inline

# @sorted =
    map { $_->[1] }
    sort { i_tr_sort( $a->[0], $b->[0] ) }
    map { /([\w\-\']+)[\W\-\']+(.*)/; [ "$2 $1", $_ ] }
    @data;

print <<"EOM";
PROCESS INLINE:
  number of words: @{[scalar @data]}
  words compared: $num_word_cmps
  chars compared: $num_char_cmps
  lookups: $num_lookups

EOM

##############################


# reset counters

( $num_lookups, $num_word_cmps, $num_char_cmps ) = ( 0, 0, 0 );


##############################
#
# 2nd run: Pre-process

# @sorted =
    map { $_->[1] }
    sort { i_sort( $a->[0], $b->[0] ) }
    map { i_tr( $_, PERMUTED_ALPHABET ) }
    map { /([\w\-\']+)[\W\-\']+(.*)/; [ "$2 $1", $_ ] }
    @data;

print <<"EOM";
PREPROCESS LIST:
  number of words: @{[scalar @data]}
  words compared: $num_word_cmps
  chars compared: $num_char_cmps
  lookups: $num_lookups

EOM

##############################

__DATA__
Jill Harmon
Walter Reston
Norma Kretschmer
Gus Tribble
Fiorella Squamuglia
Marv Smith
Marv Smithy
Mark Smith
Marvin Smith
Bill Smith
Ed Squid
John Macdonald
Angus MacDonald
#!/usr/perl

use strict;
use warnings;

use Data::Dumper;

# randomize
# for each elem, switch with another random elem.

my $perm = 'QWERTYUIOPASDFGHJKLZXCVBNM';

my @perm;
for my $i ( 0 .. (length($perm) - 1) ) {
    my $ch = substr( $perm, $i, 1 );
    my ($u, $l) = (uc($ch), lc($ch));
    $perm[ord($u)] = $i + 256; # higher than the ord of any char we will
    $perm[ord($l)] = $i + 256; # encounter, assuming ASCII. I.E. sort
                               # punctuation first.
}

sub i_sort {
    my( $a, $b ) = @_;
# print "'$a' cmp '$b'\n";

    my $a_len = length($a);
    my $b_len = length($b);
    my $min_len = $a_len < $b_len ? $a_len : $b_len;

    for my $i ( 0 .. ($min_len-1) ) {

        my $a_ch = substr( $a, $i, 1 );
# print "'$a_ch':";
        my $a_tr = $perm[ord($a_ch)];
        $a_ch = defined($a_tr) ? $a_tr : ord($a_ch);
# print "$a_ch <=> ";

        my $b_ch = substr( $b, $i, 1 );
# print "'$b_ch':";
        my $b_tr = $perm[ord($b_ch)];
        $b_ch = defined($b_tr) ? $b_tr : ord($b_ch);
# print "$b_ch => ";

        my $rel = $a_ch <=> $b_ch;
# print "$rel\n";
        next if ($rel == 0);
        return $rel;
    }

    return -1 if ($a_len  < $b_len);
    return  0 if ($a_len == $b_len);
    return  1; # if ($a_len >  $b_len);
}

chomp( my @data = <DATA> );

my @names =
    map { $_->[1] }
    sort { i_sort( $a->[0], $b->[0] ) }
    map { /([\w\-\']+)[\W\-\']+(.*)/; [ "$2 $1", $_ ] }
    @data;

print join "\n", @names;

__DATA__
Jill Harmon
Walter Reston
Norma Kretschmer
Gus Tribble
Fiorella Squamuglia
Marv Smith
Marv Smithy
Mark Smith
Marvin Smith
Bill Smith
Ed Squid
John Macdonald
Angus MacDonald
<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

Recently Viewed:
linux.arklinux....    user-groups.lin...    kde.usability/2...    ietf.ipp/2002-0...    mail.spam.spamc...    os.netbsd.devel...    audio.cd-record...    text.unicode.de...    php.documentati...    games.fps.halfl...    window-managers...    suse.oracle.gen...    bug-tracking.gn...    video.dvdrip.us...    xfree86.cvs/200...    java.netbeans.m...    network.argus/2...    culture.sf.kill...    debian.ports.al...    freebsd.questio...    qplus.devel/200...    handhelds.palm....   
Home | blog view | USPTO Patent Archive | advertise | OSDir is an inevitable website. super tiny logo

Free Magazines

Cisco News
Receive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business.
subscribe

Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field.
subscribe

The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business.
subscribe

Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company.
subscribe

Total Telecom Total Telecom is "The Economist of the communications industry".
subscribe