logo       

Re: [QUIZ] Perl 'Easy' Quiz of the Week #2005-1 (benchmarking: sorry): msg#00055

lang.perl.qotw.discuss

Subject: Re: [QUIZ] Perl 'Easy' Quiz of the Week #2005-1 (benchmarking: sorry)

<Oops, I've borked badly tonight. This is the text I wanted to send to the
list>

Benchmarking?

Given that the example data look very much like (obfuscated) stock
tickers (or rather, given the "extra points" question, tickers for some sort
of more exotic financial instrument), a discussion of speed is no doubt
otiose (but it was a rainy Sunday afternoon here):
1) There is no way that a file of such tickers could be considered 'big' by
today's memory standards.
2) Adding an extra suffix would only be necessary at the time of a major
market shake-up - say, at most, every couple of years or so - which
means that it doesn't matter a damn whether the program takes a few
milliseconds or a minute or so to run...

Nevertheless, the spec does say that "the file may be entirely too big to
fit into available memory", which means _big_. And this in turn suggests
that the speed of parsing each individual line might become a crucial
issue, especially if the code needed to be run frequently.

So, for testing purposes, I created a file with 10 million lines each
matching the regular expression /^[A-Z]{1,4}\.[A-Z]$/ (like the example
dataset):

open my $fh, '>', 'textdata.txt' or die "Can't open: $!";
for ( 1 .. 10_000_000 ) {
for ( 0 .. int rand 4 ) {
print $fh chr( ( int rand 26 ) + 65 );
}
print $fh '.', chr( ( int rand 26 ) + 65 ), "\n";
} # Coffee break (well, at least on my machine :)

The most obvious ways that spring to mind of parsing a line like
'ALDA.D' into a 'prefix' and a 'suffix' are a capturing regex and split.
Indeed, all of the posted solutions that I have seen* to date appear to
use one or other of these.
* I do not open email attachments from people I do not know and trust.

Capturing regex (typical implementation):

my $start = time;
while ( <> ) {
my ( $pref, $suff ) = /(\w+)\.([A-Z])/;
}
my $delta = time - $start;
print $delta;

OK, there's no error checking, and the $pref and $suff variables are
thrown away, but that's to keep the comparisions pure.

On my machine, run a few times, this takes an average of 47 seconds
for 10 million lines.

This can be (very) slightly improved by compiling the regex just once
(replace the body of the while loop with the following):

my ( $pref, $suff ) = /(\w+)\.([A-Z])/o;

Average time: 45 seconds

Next, as Brad Greenlee points out, split is faster than a capturing regex:

my ( $pref, $suff ) = split /\./;

Average time: 33 seconds.

However, split still uses the regex engine, and a totally non-regex
solution is much faster:

my $pref = substr $_, 0, ( index $_, '.' );
my $suff = substr $_, ( index $_, '.' ) + 1, 1 ;

Average time: 15 seconds (more than twice as fast as split, and more
than three times faster than a capturing regex).

Slightly faster still, though, at 13 seconds, is 'chomp-chop-chop':

chomp;
my $suff = chop;
chop ( my $pref = $_ );

That said, nothing in the spec indicates that the maximum number of
characters in the prefix should be four. So I created another test datafile,
where the maximum prefix length is 20 (still 10 million lines). Crucially,
chomp-chop-chop takes exactly the same amount of time to process
this file, while all the other methods take somewhat longer:

chomp-chop-chop: 13 (vs 13 for max. prefix length of 4)
substr: 16.5 (vs 15)
split: 39 (vs 33)
regex/o: 48 (vs 45)
regex: 51 (vs 47)

Any improvements on chomp-chop-chop?
Anyone with a more powerful box (or more time) care to test on really
long lines?

--
otiose dave







<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise