Looking through the solutions posted, it's interesting that most
everyone is preprocessing the name list before sorting. I had a thought
that it might be faster to only translate or lookup the characters that
needed to be compared. For example, if you're comparing 'Squid' with
'Squamuglia', it is only neccessary to translate the first four
characters (8 total lookups) of each word. Whereas if you preprocess the
the names you end up doing 15 total lookups. The majority of cases
/should/ show a similar pattern.
To test this, I constructed a small model to measure operations: lookups
and comparisons. It's attached as 16-inst.pl. The instrumented model
doesn't actually do anything except go through the motions, taking
counts of some operations. For the small sample of names included the
results look something like:
PROCESS INLINE:
number of words: 13
words compared: 35
chars compared: 100
lookups: 200
PREPROCESS LIST:
number of words: 13
words compared: 35
chars compared: 100
lookups: 338
A larger sample produces an even greater savings in lookups.
To take a closer look, I threw together a sample sorting routine,
attached as 16-sort.pl. It probably seems a bit verbose, but this is
mainly because it is neccessary to perform some operations explicitly
whereas the other solutions allow perl to do the operations under the hood.
There are a couple of issues I still have with this implementation. One,
as mentioned, it was neccessary to perform explicitly-in perl code-what
the other solutions get implicitly through perls built-ins which are
heavily optimized C code. I doubt my iteration over each char for
comparison comes anywhere close to the optimized C code in perl's sort
routine. Two, each name will be compared multiple times, meaning the
same translation/lookup will be performed multiple times for each word.
Caching /might/ help.
Just some thoughts,
Randy.
#!/usr/bin/perl
use strict;
use warnings;
use constant PERMUTED_ALPHABET => 'QWERTYUIOPASDFGHJKLZXCVBNM';
our( $num_lookups, $num_word_cmps, $num_char_cmps );
sub i_tr {
my( $str, $alpha ) = @_;
$num_lookups += length( $alpha );
return $str;
}
sub i_sort {
my( $a, $b ) = @_;
++$num_word_cmps;
my $a_len = length($a);
my $b_len = length($b);
my $min_len = $a_len < $b_len ? $a_len : $b_len;
for my $i ( 0 .. ($min_len-1) ) {
my $a_ch = substr( $a, $i, 1 );
my $b_ch = substr( $b, $i, 1 );
my $rel = $a_ch cmp $b_ch;
++$num_char_cmps;
next if ($rel == 0);
return $rel;
}
return -1 if ($a_len < $b_len);
return 0 if ($a_len == $b_len);
return 1; # if ($a_len > $b_len);
}
sub i_tr_sort {
my( $a, $b ) = @_;
++$num_word_cmps;
my $a_len = length($a);
my $b_len = length($b);
my $min_len = $a_len < $b_len ? $a_len : $b_len;
for my $i ( 0 .. ($min_len-1) ) {
my $a_ch = substr( $a, $i, 1 );
my $b_ch = substr( $b, $i, 1 );
# perform lookups
$num_lookups += 2;
my $rel = $a_ch cmp $b_ch;
++$num_char_cmps;
next if ($rel == 0);
return $rel;
}
return -1 if ($a_len < $b_len);
return 0 if ($a_len == $b_len);
return 1; # if ($a_len > $b_len);
}
############################################################
#
# BEGIN
# init
chomp( my @data = <DATA> );
# -or-
# open( FH, 'words' ) or die;
# chomp( my @data = <FH> );
# close( FH );
##############################
#
# 1st run: Process Inline
# @sorted =
map { $_->[1] }
sort { i_tr_sort( $a->[0], $b->[0] ) }
map { /([\w\-\']+)[\W\-\']+(.*)/; [ "$2 $1", $_ ] }
@data;
print <<"EOM";
PROCESS INLINE:
number of words: @{[scalar @data]}
words compared: $num_word_cmps
chars compared: $num_char_cmps
lookups: $num_lookups
EOM
##############################
# reset counters
( $num_lookups, $num_word_cmps, $num_char_cmps ) = ( 0, 0, 0 );
##############################
#
# 2nd run: Pre-process
# @sorted =
map { $_->[1] }
sort { i_sort( $a->[0], $b->[0] ) }
map { i_tr( $_, PERMUTED_ALPHABET ) }
map { /([\w\-\']+)[\W\-\']+(.*)/; [ "$2 $1", $_ ] }
@data;
print <<"EOM";
PREPROCESS LIST:
number of words: @{[scalar @data]}
words compared: $num_word_cmps
chars compared: $num_char_cmps
lookups: $num_lookups
EOM
##############################
__DATA__
Jill Harmon
Walter Reston
Norma Kretschmer
Gus Tribble
Fiorella Squamuglia
Marv Smith
Marv Smithy
Mark Smith
Marvin Smith
Bill Smith
Ed Squid
John Macdonald
Angus MacDonald
#!/usr/perl
use strict;
use warnings;
use Data::Dumper;
# randomize
# for each elem, switch with another random elem.
my $perm = 'QWERTYUIOPASDFGHJKLZXCVBNM';
my @perm;
for my $i ( 0 .. (length($perm) - 1) ) {
my $ch = substr( $perm, $i, 1 );
my ($u, $l) = (uc($ch), lc($ch));
$perm[ord($u)] = $i + 256; # higher than the ord of any char we will
$perm[ord($l)] = $i + 256; # encounter, assuming ASCII. I.E. sort
# punctuation first.
}
sub i_sort {
my( $a, $b ) = @_;
# print "'$a' cmp '$b'\n";
my $a_len = length($a);
my $b_len = length($b);
my $min_len = $a_len < $b_len ? $a_len : $b_len;
for my $i ( 0 .. ($min_len-1) ) {
my $a_ch = substr( $a, $i, 1 );
# print "'$a_ch':";
my $a_tr = $perm[ord($a_ch)];
$a_ch = defined($a_tr) ? $a_tr : ord($a_ch);
# print "$a_ch <=> ";
my $b_ch = substr( $b, $i, 1 );
# print "'$b_ch':";
my $b_tr = $perm[ord($b_ch)];
$b_ch = defined($b_tr) ? $b_tr : ord($b_ch);
# print "$b_ch => ";
my $rel = $a_ch <=> $b_ch;
# print "$rel\n";
next if ($rel == 0);
return $rel;
}
return -1 if ($a_len < $b_len);
return 0 if ($a_len == $b_len);
return 1; # if ($a_len > $b_len);
}
chomp( my @data = <DATA> );
my @names =
map { $_->[1] }
sort { i_sort( $a->[0], $b->[0] ) }
map { /([\w\-\']+)[\W\-\']+(.*)/; [ "$2 $1", $_ ] }
@data;
print join "\n", @names;
__DATA__
Jill Harmon
Walter Reston
Norma Kretschmer
Gus Tribble
Fiorella Squamuglia
Marv Smith
Marv Smithy
Mark Smith
Marvin Smith
Bill Smith
Ed Squid
John Macdonald
Angus MacDonald
|