logo       
Google Custom Search
    AddThis Social Bookmark Button
-->

QOTW #14 ... yet more improvement: msg#00086

Subject: QOTW #14 ... yet more improvement
I may seek therapy if I am unable to stop my brain from going 
to this problem... 

Removed a test that became redundant some time ago, eliminated 
unnecessary looping and two unnecessary operations, and squoze 
an extra character advance in three places.  Basically, cleaned 
it up a lot.  For all I know it's just like Cees' now, I've 
made a point of not studying his code too hard until I'm done.


sub repeated_substring
{
  my $len = length($_[0]);

  my ($best, $bestlen) = ("", 0);

  my ($left, $leftlen, $rightpos, $skip, $optimizing, $half, $rem);
  my $leftstart = 0;
  while ($leftstart <= $len - (($bestlen+1) << 1))
  {
    $rightpos = $leftstart + ($leftlen = 1 + $bestlen);

    while ($skip = 1)        # Triggers syntax warning 
    {
      if ($optimizing)
      {
        # If last half of substring at current position
        # isn't anywhere in rest of string, we can skip
        # to 1 past its beginning
        $half = $leftlen >> 1;
        if (index($_[0], substr($_[0], $leftstart+$half + ($rem = $leftlen&1), 
$half),
                  $rightpos) < 0)
        {
          $skip = $half + 1 + $rem;
          last;
        }
      }
      $left = substr($_[0], $leftstart, $leftlen);

      last unless ($rightpos = index($_[0], $left, $rightpos)) >= 0;

      $best       = $left; 
      $bestlen    = $leftlen++;
      $optimizing = $bestlen > 16;

      $rightpos++ if $leftstart + $leftlen > $rightpos;  # Overlapping
    }
    $leftstart += $skip;
  }
  return $best;
}


=for comment

The algorithm is pretty ordinary.  We march along the string starting
from each character position in turn, generating successively longer
substrings from that point and iterating through any and all matches
for each substring occurring in the main string after the end of the
substring.  

Everything else is just optimization; we don't continue searching in
the string past a point where no repeated substring could be longer 
than the best one found so far.  We start the candidate substring
out at a length equal to the length of the best one found so far,
plus one.  We don't calculate that substring if we're optimizing and
don't need it.

The main optimization is executed if $optimizing is true, which occurs
when the length of the current best match is more than 16 characters.
I chose this value after reading Devel::SmallProf dumps of the
unoptimized version and seeing how long various operations were
taking, then trying some candidates.  The optimization is that if the
last half of the candidate substring at the current position is not
found in the rest of the string, then we can skip forward to one
character past the beginning of that substring.  The longer the best
match found so far, the faster this works.

Suppose the current best match is of length 17, then we are testing
candidate solutions of length 18.  Here our current left starting
position is #1:

|_________ABCDEFGHIJKLMNOPQR____________________|
          1         2       3

If 'JKLMNOPQR' is not found in the string starting at #3, we can
advance the left starting position to #2.

I decided not to try further division on the grounds of diminishing
returns; the more tests that are added in the loop, the more the time
will increase for cases where the optimization isn't effective.

=cut




<Prev in Thread] Current Thread [Next in Thread>