Mark Jason Dominus wrote:
Rod Adams <rod-3IKiZD1jCB7k1uMJSBkQmQ@xxxxxxxxxxxxxxxx>:
A reasonable compromise might be as follows:
#pseudo code
seek(rand)
$n = rand(1..100)
for 0..$n
skip line
seek 0 if eof
$result = read line
Hey, that's a neat idea. I wanted it to work, but to my dismay it
turns out that it doesn't work all that well. With the small
dictionary (http://perl.plover.com/qotw/words/words.bz2) the words get
picked with relative probabilities between 608 ("abc") and 1155
("indignation"), which means that the most commonly picked word gets
picked almost twice as often as the least commonly picked word.
Still, for Hangman that's probably good enough. (Another downside is
that the frequent words tend to appear in clusters, as do the
infrequent words. The five most-commonly chosen words are
"indigestion", "indignant", "indignation", "indignity", and "indigo",
and the five least-commonly chosen are "abbe", "abey", "abbot",
"Abbott", and "abc".)
In general, I think I can live with a <200% difference. If you need
something stronger, you should probably either read the whole file or
build a separate index file.
But it seems to me that by arranging the words in the dictionary in a
suitable order, the bias could be reduced almost to zero. You would
want to avoid large clumps of long or short words. I can't think
offhand of any way of achieving this, but it seems that it should be
doable. Perhaps that would be a good topic for a future quiz.
Let's see....
I wrote a pair of scripts. One to test how well a given data file is
balanced, with various window sizes, and one to re-balance a data file.
Results:
D:\Code>test.pl words
1 Dev: 28.77% Min: 24.33% Max: 279.82%
2 Dev: 23.66% Min: 30.42% Max: 273.74%
4 Dev: 19.77% Min: 39.54% Max: 240.28%
8 Dev: 16.79% Min: 44.10% Max: 209.87%
16 Dev: 14.51% Min: 51.71% Max: 175.65%
32 Dev: 12.61% Min: 64.63% Max: 167.67%
64 Dev: 11.05% Min: 73.19% Max: 148.66%
128 Dev: 9.82% Min: 76.89% Max: 141.05%
256 Dev: 8.66% Min: 78.94% Max: 133.35%
D:\Code>test.pl words2
1 Dev: 28.77% Min: 24.33% Max: 279.82%
2 Dev: 21.06% Min: 36.50% Max: 218.99%
4 Dev: 13.73% Min: 51.71% Max: 167.29%
8 Dev: 7.67% Min: 66.91% Max: 138.39%
16 Dev: 4.23% Min: 84.40% Max: 117.86%
32 Dev: 2.17% Min: 90.87% Max: 109.50%
64 Dev: 1.13% Min: 94.86% Max: 105.31%
128 Dev: 0.58% Min: 97.52% Max: 102.75%
256 Dev: 0.30% Min: 98.80% Max: 101.46%
"words" is your small words file, "words2" is a re-balanced version of
the same.
First number is the size of the "look after random seek" window, Dev is
std dev of frequency, Min/Max is frequency relative to average frequency.
As you can see, my balancing algorithm isn't perfect, but it certainly
helps things out a great deal.
Scripts attached.
Comments welcome.
-- Rod Adams.
open IN, $ARGV[0] or die;
open OUT, '>'.$ARGV[1] or die;
$sum = $count = 0;
while (<IN>) {
push @words, $_;
push @lengths, length;
}
close IN;
Balance(0,$#words);
print OUT @words;
close OUT;
sub Balance {
my ($start, $stop) = @_;
return if ($stop - $start) < 3;
my $mid = int(($stop - $start+1)/2-1);
my $sum1 = 0;
for (my $x = $start ; $x <= $start + $mid ; ++$x) {
$sum1 += $lengths[$x];
}
my $sum2 = 0;
for (my $x = $stop ; $x >= $stop - $mid ; --$x) {
$sum2 += $lengths[$x];
}
my $target = int(($sum1 + $sum2) / 2);
my $avg = $target/$mid;
my ($w1, $w2) = ($start, $stop);
while ($sum1 != $target) {
if ($sum1 < $target) {
++$w1 while ($lengths[$w1] >= $avg && $w1 <= $start + $mid);
--$w2 while ($lengths[$w2] <= $avg && $w2 >= $stop - $mid);
} else {
++$w1 while ($lengths[$w1] <= $avg && $w1 <= $start + $mid);
--$w2 while ($lengths[$w2] >= $avg && $w2 >= $stop - $mid);
}
last if $w1 > $start + $mid;
last if $w2 < $stop - $mid;
$sum1 += $lengths[$w2] - $lengths[$w1];
$sum2 += $lengths[$w1] - $lengths[$w2];
@lengths[$w1, $w2] = @lengths[$w2, $w1];
@words[$w1, $w2] = @words[$w2, $w1];
++$w1; last if $w1 > $start + $mid;
--$w2; last if $w2 < $stop - $mid;
}
Balance($start, $start+$mid);
Balance($stop-$mid, $stop);
}
__END__
# get length of lines into an array
while (<>) {
push @length, length;
}
$SEEK_LENGTH = 1;
while ($SEEK_LENGTH <= 256) {
# compute total length of prior $SEEK_LENGTH words, and do some stats work
$sum_norm = 0;
$sum_square = 0;
$min = 1e1000;
$max = -1e1000;
for $x (0..$#length) {
$tmp = 0;
for $y (1..$SEEK_LENGTH) {
$tmp += $length[($x-$y) % @length];
}
$sum_norm += $tmp;
$sum_square += $tmp**2;
$min = $tmp if $tmp < $min;
$max = $tmp if $tmp > $max;
}
$avg = $sum_norm / @length;
$avgsqr = $sum_square / @length;
printf("%5d Dev: %5.2f%% Min: %5.2f%% Max: %5.2f%%\n",
$SEEK_LENGTH, ($avgsqr-$avg**2)**0.5/$avg*100,
$min/$avg*100, $max/$avg*100);
$SEEK_LENGTH *= 2;
}
__END__
|