logo       
Google Custom Search
    AddThis Social Bookmark Button

Re: [SPOILER] Algorithmic approach to Expert QOTW 14: msg#00097

Subject: Re: [SPOILER] Algorithmic approach to Expert QOTW 14
In article 
<Pine.SOL.3.91.1030616013253.29680A-100000-hg4xrdQvNp0Y/BO0ai66Mjv7f71NJ7xlN7jzCyCsa88@xxxxxxxxxxxxxxxx>,
        jmv16-HmMyXyqgL2CVc3sceRu5cw@xxxxxxxxxxxxxxxx writes:
> So, this repeated-substring function is a known problem in the computer
> science world (I learned about it undergrad, anyway).  There's a
> linear-time linear-space algorithm using suffix trees, for example.
> 
Unfortunately a standard suffix tree with standard lookup gives you 
repeated string *with* overlaps allowed

> I just ported some suffix tree code I found on the web from (get this) 
> JavaScript to Perl.  It works just fine (e.g., Constitution in 8 seconds 

Yup, I went through the same exercise. Athlon XP1800+, constitution in
2 seconds, shandy in 80 seconds.

> on PIII 850 MHz, and the code is horribly unoptimized), but I can't run 
> it on longer inputs because it's eating memory like crazy.
> 
Absolutely. Perl is just horribly memory inefficient. You construct
complex datastructures at about one per character. It seems to use
about 400 times as much memory as the size of the input text.

> I suspect this is because the reference-counting garbage collector is not 
> intelligent enough to reclaim everything it should.  But going through 

Probably not actually. The backlinks indeed cause circular references, so
the structure doesn't get cleaned up. This is a problem if you construct
multiple suffix trees in one run, but not otherwise.

The backlinks are only one entry for each internal node to a node that's
kept alive by the root anyways. So cutting the links won't gain you memory.

> and breaking the links manually is going to be a mess (the data structure 
> is a sort of tree with pointers from children to parents, so I'd have to 
> grovel through it all), and so I'd like to confirm what's going on before 

You can actually use weak references for the backpointers. This way perl
will take care of it for you (and indeed using weak references doesn't
gain you anything on tree construction)

> spending on more time on that aspect.  So I guess what I'm asking is:
> 
> 
>     Is there any way to profile or otherwise analyze memory usage?
> 
> 
> (Anybody who wants the code is welcome to it, but it's quite messy at 
> present and not going to make sense unless you've worked through the 
> pseudocode at http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Suffix/
> or a similar place beforehand.)
> 

Yup, I used the same source. I'll attach my code for comparision.

> 
> On a side note, has anybody tried to link the BDW collector with Perl
> instead of this reference counting nonsense?  Boy, would that make my day. 
> 
I don't think so in fact. I predict it would basically run in the same time

#! /usr/bin/perl -w
use strict;
use Scalar::Util qw(weaken);

# use lib "/home/ton/lib";
# use Debug qw(debug);
# use Data::Dumper;

use constant INFINITY => 1e5000;
# We (ab)use key "" to represent the suffix links
use constant SLINK    => "";

my ($s, $k, $txt, $root);

# For debugging. Nicely draw the suffix tree
sub DrawTree {
    my ($s, $prefix) = @_;
    print "-->", $s->{+SLINK} ? "+" : "*";
    $prefix .= "   ";
    my @keys = sort keys %$s;
    shift @keys if $keys[0] eq SLINK;
    for (0..$#keys) {
        print $prefix,"|\n" if $_;
        print $prefix,"+" if $_;
        my ($k1, $l1, $s1) = @{$s->{$keys[$_]}};
        if ($l1 == INFINITY) {
            print "---", substr($txt, $k1), "\n";
        } else {
            print "---", substr($txt, $k1, $l1);
            DrawTree($s1,$prefix . ($_ == $#keys ? " " : "|") . " " x (3+$l1));
        }
    }
}

sub Update {
    my $i = shift;
    # (s, (k, i-1)) is the canonical reference pair for the active point
    my $old_root = $root;
    my $chr = substr($txt, $i, 1);
    while (my $r  = TestAndSplit($i-$k, $chr)) {
        $r->{$chr} = [$i, INFINITY, {}];
        # build suffix-link active-path
        weaken($old_root->{+SLINK} = $r) if $old_root != $root;
        $old_root = $r;
        $s = $s->{+SLINK};
        Canonize($i-$k);
    }

    weaken($old_root->{+SLINK} = $s) if $old_root != $root;
}

sub TestAndSplit {
    my ($l, $t) = @_;
    return !$s->{$t} && $s unless $l;
    my ($k1, $l1, $s1)  = @{$s->{substr($txt, $k, 1)}};

    my $try = substr($txt, $k1 + $l, 1);
    return if $t eq $try;
    # s---->r---->s1
    my %r = ($try => [$k1 +$l, $l1-$l, $s1]);
    $s->{substr($txt, $k1, 1)} = [$k1, $l, \%r];
    return \%r;
}

sub Canonize {
    # s--->...
    my $l = shift || return;

    # find the t_k transition g'(s,(k1,l1))=s' from s
    my ($k1, $l1, $s1) = @{$s->{substr($txt, $k, 1)}};
    # s--(k1,l1)-->s1
    while ($l1 <= $l) {
        # s--(k1,l1)-->s1--->...
        $k += $l1;  # remove |(k1,l1)| chars from front of (k,l)
        $l -= $l1;
        $s  = $s1;
        # s--(k1,l1)-->s1
        ($k1, $l1, $s1) = @{$s->{substr($txt, $k, 1)}} if $l;
    }
}

# construct suffix tree for $txt[0..N-1]
sub BuildTree {
    $txt = shift();
    # bottom or _|_
    my %bottom;
    $root = {SLINK() => \%bottom};
    # Create edges for all possible chars from bottom to root
    $bottom{substr($txt, $_, 1)} ||= [$_, 1, $root] for 0..length($txt)-1;

    $s = $root;
    $k = 0;
    for (0..length($txt)-1) {
        # follow path from active-point
        Update($_);
        Canonize($_-$k+1);
    }
    delete $root->{+SLINK};
    return $root;
}

my $best;
sub lrs {
    my ($s, $depth) = @_;
    # Skip leaves
    return unless %$s;
    my $edges;
    for (keys %$s) {
        next if $_ eq SLINK;
        my ($k, $l, $node) = @{$s->{$_}};
        if (my $new_edges = lrs($node, $depth+$l)) {
            push(@$new_edges, $k, $l);
            $edges = $new_edges;
        }
    }
    return $edges if $best >= $depth;
    $best = $depth;
    return [0, 0];
}

sub LongestRepeatedSubstring {
    my $tree = shift;
    $best = -1;
    my $edges = lrs($tree, 0);
    my $string = "";
    while (@$edges) {
        my $l = pop @$edges;
        my $k = pop @$edges;
        $string .= substr($txt, $k, $l);
    }
    return $string;
}

# debug(qw[Update TestAndSplit Canonize(d)]);
my $tree;
# Three ways of calling:
if (@ARGV) {
    # Use $ as terminator char. Input text should NOT contain it.
    if ($ARGV[0] =~ m![./]!) {
        # If the argument contains a . or /, guess it's a file
        open(my $fh, "<", $ARGV[0]) || die "Could not open $ARGV[0]: $!";
        my $text = do { local $/; <$fh> . "\$" };
        $tree = BuildTree($text . "\$");
        print "LRS=", LongestRepeatedSubstring($tree), "\n";
    } else {
        # Otherwise, assume the argument is a string
        $tree = BuildTree(shift() . "\$");
        print "LRS=", LongestRepeatedSubstring($tree), "\n";
        DrawTree($tree);
    }
} else {
    # Default string for basic sanity checking and demo
    $tree = BuildTree("xabxac");
    DrawTree($tree);
}



<Prev in Thread] Current Thread [Next in Thread>