osdir.com
mailing list archive

Subject: Re: Solutions and Discussion for Perl Quiz of the Week #17 - msg#00063

List: lang.perl.qotw.discuss

Date: Prev Next Index Thread: Prev Next Index
A full db would be overkill. All we need is an index:

#!/usr/bin/perl

use constant REC_SIZE => 11;

for (0 .. $ARGV[0]) {
print get_random_word('test'), "\n";
}

sub get_random_word {
my $dict = shift;
my $index = "$dict.idx";

#if the index doesn't exist then create it
#I could probably check the file dates and
#rebuild the index when the index is older
#than the dict, but I don't care that much
unless (-f $index) {
open DICT, '<', $dict
or die "Could not open $dict:$!";
open INDEX, '>', $index
or die "Could not open $index:$!";

#the first record must be written manually
print INDEX 0 x (REC_SIZE-1), "\n";

#this writes one too many records
#so we will have to ignore the it
#later when we get the random num
my $pattern = '%0.' . (REC_SIZE-1) . "d\n";
while (my $line = <DICT>) {
printf OUT $pattern, tell DICT;
}
close DICT;
close INDEX;
}

open INDEX, '<', $index
or die "Could not open $index:$!";

open DICT, '<', $dict
or die "Could not open $dict:$!";

#choose a random record in the index
seek (
INDEX,
REC_SIZE * int
rand((stat(INDEX))[7]/REC_SIZE-1),
0
);

#read the record to get the offset
my $word_pos = <INDEX>;
seek DICT, $word_pos, 0;
my $word = <DICT>;

close INDEX;
close DICT;

chomp($word);
return $word;
}



Was this page helpful?
Yes No
Thread at a glance:

Previous Message by Date: click to view message preview

Re: Solutions and Discussion for Perl Quiz of the Week #17

On Mon, Jun 07, 2004 at 10:43:03AM +0200, "Christian D?hl" wrote: > It's perhaps a little bit silly, but whats about putting the wordbook > into a database? Then we could easily get the number of stored > records, and with the help of an autoincrement primary key, we only > need to call rand once and we need to get exactly one "line" out of > our database. > > If the book isn't changing, this seems to be an idea!? If the book isn't changing, seems that a DB would be overkill - just record somewhere the number of entries, and be done with it. -- Dan Boger dan-rlx3YLNxYWXQT0dZR+AlfA@xxxxxxxxxxxxxxxx

Next Message by Date: click to view message preview

Re: [SPOILER] (but no code) - idea for representing tiles

Mark Jason Dominus writes: [...] > The really important thing is to glue two tiles together. [...] > Important rule here: the > only interesting attachments occur when the spliced list has a pair of > adjacent numbers that differ by 6. The second list contains 3 folowed > by 9. When this happens, it's because the two tiles are edge-to-edge. Does this imply that only one tile can be joined to the side of another tile? -K -- "I've had nothing yet," Alice replied in an offended tone, "so I can't take more." "You mean you can't take less," said the Hatter: "it's very easy to take more than nothing."

Previous Message by Thread: click to view message preview

Re: Solutions and Discussion for Perl Quiz of the Week #17

Ron Isaacson wrote:> Pat C. wrote:> >> > Ron Isaacson wrote:> >> > > BUT... you need to know in advance how long the shortest word in the> > > file is. So it fails the original intention, which is to pick a random> > > word efficiently given an arbitrary input file.> >> > Well, if you don't know how short the shortest word is, then you canjust> > assume it is one character and you won't go wrong. If you do know more> > about your dictionary file though, you can make it faster by specifyingthe> > minimum length.>> Actually, I don't think that's true. IIRC, your algorithm relies on> the minimum length to know if it landed in the middle of a word. If> $MIN_WORD_LEN is smaller than the actual minimum word length in the> file, it might return a partial word.No, no, it relies on the minimum word length to know how early in a line itcan seek without creating a bias towards longer words.  If the shortest wordin the file is 4 characters, then it can accept a seek as far as 4characters from the end of the word without creating a bias.  Once it hasfound an acceptable seek point, it returns the next line.  I don't thinkthere's any chance of returning a partial word.  Having $MIN_WORD_LENsmaller than the shortest word should do no harm other than running slightlyslower.-Pat C.

Next Message by Thread: click to view message preview

Wrap-up for word analyzer and guesser

I won't have a chance to work any more on the analyzer (patan) for the next day or so, and the list is moving on, so I thought I'd just wrap-up by jotting down some final thoughts for anyone who cares. After posting my last version of the pattern analyzer (well, sequence analyzer as it really just looks at sequences which is only a small subset of patterns), I reallized that it might increase accuracy further to localize the sequences within words. My thinking, and it needs measurements to confirm, is that certain sequences may be more likely to appear at the beginning, middle, or end of a word. To put it in terms of my chosen algorithm: the sequence "io", if at the end of a word, may indicate that a guess of "n" is most likely to succeed, whereas if the sequence "io" appeared at the beginning of a word some other letter /might/ be more likely. It's unclear without measurements how much of an improvement this would yield or how many divisions would yield the best improvement. The above change would likely neccessitate a change in data storage. Storable is convenient, but it is never the ideal storage mechanism as it is too general to be as compact as it could be if written for specific data. A dbm or custom data format are the possible alternatives. As an aside, I was attempting to use Data::Dumper to look at portions of the data I was gathering, but hit a wall... Out of Memory! It looks like Data::Dumper would have a method that took a file handle and printed as it went instead of returning a string of the entire structure at once. Patching Data::Dumper in this way would make an excellent expert challenge. Anyway, here is what I had intended for the word guessing algorithm: 1. Guess a letter based on frequency: A. If we do not know the length of the word, we guess using simple letter frequencies derived from the entire dictionary (eg. `patan --dict mydict --query -` returns this letter frequency table). B. If we do know the length of the word, read the wordlist, recording all words of that length. Then derive a simple letter frequency table from the filtered wordlist. 2. If the guess is wrong, record it so we can filter the wordlist and regenerate frequencies. (@filtered_wordlist = grep { $_ =~ /^[^badchars]+$/ } @wordlist) Return to step 1 until a correct guess is made. 3. Once we have at least one correct letter we can switch to analyzing sequences. For each correct letter or sequence of letters, look at the most probably letter to follow or preceed that letter or (localized) sequence. Compare probabilities for letters before or after /each/ sequence, not forgetting cases where a given letter is, say, 2nd most probable in coming before one sequence and, say, 3rd most probable in coming after a different sequence, making the given letter more probable in total than any other letter. Repeat 'till done. Randy.
Sign up for updates to this mailing list. email:
Loading Comments...
Home | News | Patents | Sitemap | FAQ | advertise

Advertising by