logo       
Google Custom Search
    AddThis Social Bookmark Button
-->

Re: [SPOILER] Algorithmic approach to Expert QOTW 14: msg#00130

Subject: Re: [SPOILER] Algorithmic approach to Expert QOTW 14
On Tue, 17 Jun 2003, Ton Hospel 
<qotwdiscuss-rAR/wO4k2tt4hMtRath00g@xxxxxxxxxxxxxxxx> wrote:

> I massaged the suffix trees a bit to make them into a real solution to 
> the problem: strings without overlap. I'm however not 100% sure that the
> code i use to avoid overlap is always right.

The opposite of "I have only proved it correct, not tried it," eh?

I just looked through your variation, and it is essentially the same one I
came up with in the shower the other morning (after wrestling with it for
several days, from the moment whoever-on-this-list pointed out that the
usual suffix tree algorithm fails to account for overlaps).  In other
words, I believe the approach you use is correct.  The implementation
looks good as well, although I certainly could have missed an off-by-one
error or something. 

As for your originality hypothesis, I'm not sure.  There was a passing
mention in some notes on the web that repeated-substring-without-overlap
required augmenting the tree -- and no further details.  I did a fairly
exhaustive search of the literature, and didn't find anything.  So I'm not
sure whether that webpage was in error, or all the algorithmicists think
it's obvious and just didn't bother to publish it, or I was looking in the
wrong place, or it's really a new idea.  However, the fact that you and I
both came up with the same implementation (it wasn't *that* tricky, once I
faced the fact that I'd actually have to think it up from scratch rather
than find it in a textbook) argues that it's not brand new.

But regardless, thanks for taking the trouble to code it up, I wasn't
going to have time until after QOTW 14 was old news, and I really wanted
to see the results.  By the way, I'm impressed with the elegance of your
suffix tree port (and extension); mine was way too ugly to share.


Out of curiousity, do you have any (approximate, since it varies
enormously on the nature of the input) figures for memory usage as a
function of input length?  I'm curious whether my implementation was
actually leaking memory (due to reference counting), or just using lots of
it by being O(n) with a large constant due to heavy datastructures in
Perl.  (Presumably once the machine starts swapping, the lighter
algorithms pull ahead again -- although in theory they should lose out
eventually.)


P.S.  I get the impression that the computational biologists may have
      extremely well-optimized algorithms for approximating this problem, 
      but I did not attempt to wade through all that literature. 


--
Jeffrey M. Vinocur   *   jmv16-HmMyXyqgL2CVc3sceRu5cw@xxxxxxxxxxxxxxxx
http://www.people.cornell.edu/pages/jmv16/



<Prev in Thread] Current Thread [Next in Thread>