|
|
Subject: Re: Solutions and Discussion for Perl Quiz of the Week #17 - msg#00063
List: lang.perl.qotw.discuss
A full db would be overkill. All we need is an index:
#!/usr/bin/perl
use constant REC_SIZE => 11;
for (0 .. $ARGV[0]) {
print get_random_word('test'), "\n";
}
sub get_random_word {
my $dict = shift;
my $index = "$dict.idx";
#if the index doesn't exist then create it
#I could probably check the file dates and
#rebuild the index when the index is older
#than the dict, but I don't care that much
unless (-f $index) {
open DICT, '<', $dict
or die "Could not open $dict:$!";
open INDEX, '>', $index
or die "Could not open $index:$!";
#the first record must be written manually
print INDEX 0 x (REC_SIZE-1), "\n";
#this writes one too many records
#so we will have to ignore the it
#later when we get the random num
my $pattern = '%0.' . (REC_SIZE-1) . "d\n";
while (my $line = <DICT>) {
printf OUT $pattern, tell DICT;
}
close DICT;
close INDEX;
}
open INDEX, '<', $index
or die "Could not open $index:$!";
open DICT, '<', $dict
or die "Could not open $dict:$!";
#choose a random record in the index
seek (
INDEX,
REC_SIZE * int
rand((stat(INDEX))[7]/REC_SIZE-1),
0
);
#read the record to get the offset
my $word_pos = <INDEX>;
seek DICT, $word_pos, 0;
my $word = <DICT>;
close INDEX;
close DICT;
chomp($word);
return $word;
}
Was this page helpful?
Thread at a glance:
Previous Message by Date:
click to view message preview
Re: Solutions and Discussion for Perl Quiz of the Week #17
On Mon, Jun 07, 2004 at 10:43:03AM +0200, "Christian D?hl" wrote:
> It's perhaps a little bit silly, but whats about putting the wordbook
> into a database? Then we could easily get the number of stored
> records, and with the help of an autoincrement primary key, we only
> need to call rand once and we need to get exactly one "line" out of
> our database.
>
> If the book isn't changing, this seems to be an idea!?
If the book isn't changing, seems that a DB would be overkill - just
record somewhere the number of entries, and be done with it.
--
Dan Boger
dan-rlx3YLNxYWXQT0dZR+AlfA@xxxxxxxxxxxxxxxx
Next Message by Date:
click to view message preview
Re: [SPOILER] (but no code) - idea for representing tiles
Mark Jason Dominus writes:
[...]
> The really important thing is to glue two tiles together.
[...]
> Important rule here: the
> only interesting attachments occur when the spliced list has a pair of
> adjacent numbers that differ by 6. The second list contains 3 folowed
> by 9. When this happens, it's because the two tiles are edge-to-edge.
Does this imply that only one tile can be joined to the side of another
tile?
-K
--
"I've had nothing yet," Alice replied in an offended tone,
"so I can't take more."
"You mean you can't take less," said the Hatter:
"it's very easy to take more than nothing."
Previous Message by Thread:
click to view message preview
Re: Solutions and Discussion for Perl Quiz of the Week #17
Ron Isaacson wrote:> Pat C. wrote:> >> >
Ron Isaacson wrote:> >> > > BUT... you need to know in
advance how long the shortest word in the> > > file is. So it fails
the original intention, which is to pick a random> > > word
efficiently given an arbitrary input file.> >> > Well, if
you don't know how short the shortest word is, then you canjust> >
assume it is one character and you won't go wrong. If you do know more>
> about your dictionary file though, you can make it faster by
specifyingthe> > minimum length.>> Actually, I don't
think that's true. IIRC, your algorithm relies on> the minimum length to
know if it landed in the middle of a word. If> $MIN_WORD_LEN is smaller
than the actual minimum word length in the> file, it might return a
partial word.No, no, it relies on the minimum word length to know how
early in a line itcan seek without creating a bias towards longer
words. If the shortest wordin the file is 4 characters, then it can
accept a seek as far as 4characters from the end of the word without
creating a bias. Once it hasfound an acceptable seek point, it returns
the next line. I don't thinkthere's any chance of returning a partial
word. Having $MIN_WORD_LENsmaller than the shortest word should do no
harm other than running slightlyslower.-Pat C.
Next Message by Thread:
click to view message preview
Wrap-up for word analyzer and guesser
I won't have a chance to work any more on the analyzer (patan) for the
next day or so, and the list is moving on, so I thought I'd just wrap-up
by jotting down some final thoughts for anyone who cares.
After posting my last version of the pattern analyzer (well, sequence
analyzer as it really just looks at sequences which is only a small
subset of patterns), I reallized that it might increase accuracy further
to localize the sequences within words. My thinking, and it needs
measurements to confirm, is that certain sequences may be more likely to
appear at the beginning, middle, or end of a word. To put it in terms of
my chosen algorithm: the sequence "io", if at the end of a word, may
indicate that a guess of "n" is most likely to succeed, whereas if the
sequence "io" appeared at the beginning of a word some other letter
/might/ be more likely.
It's unclear without measurements how much of an improvement this would
yield or how many divisions would yield the best improvement.
The above change would likely neccessitate a change in data storage.
Storable is convenient, but it is never the ideal storage mechanism as
it is too general to be as compact as it could be if written for
specific data. A dbm or custom data format are the possible alternatives.
As an aside, I was attempting to use Data::Dumper to look at portions of
the data I was gathering, but hit a wall... Out of Memory! It looks like
Data::Dumper would have a method that took a file handle and printed as
it went instead of returning a string of the entire structure at once.
Patching Data::Dumper in this way would make an excellent expert challenge.
Anyway, here is what I had intended for the word guessing algorithm:
1. Guess a letter based on frequency:
A. If we do not know the length of the word, we guess using simple
letter frequencies derived from the entire dictionary (eg. `patan --dict
mydict --query -` returns this letter frequency table).
B. If we do know the length of the word, read the wordlist, recording
all words of that length. Then derive a simple letter frequency table
from the filtered wordlist.
2. If the guess is wrong, record it so we can filter the wordlist and
regenerate frequencies. (@filtered_wordlist = grep { $_ =~
/^[^badchars]+$/ } @wordlist) Return to step 1 until a correct guess is
made.
3. Once we have at least one correct letter we can switch to analyzing
sequences. For each correct letter or sequence of letters, look at the
most probably letter to follow or preceed that letter or (localized)
sequence. Compare probabilities for letters before or after /each/
sequence, not forgetting cases where a given letter is, say, 2nd most
probable in coming before one sequence and, say, 3rd most probable in
coming after a different sequence, making the given letter more probable
in total than any other letter. Repeat 'till done.
Randy.
|
|