Dear all,
A couple of weeks ago I raised an issue about Byte Order Marks effectively
disabling the header parsing functions. Thanks again to those who took the
trouble to reply. After a bit of poking around in the dark I've fixed it
for my own needs - I leave it to others to judge whether this is robust
enough for general usage, especially since I have only carried out cursory
testing.
Within the HTML Head Parser there is a routing called text like this:
sub text
{
my($self, $text) = @_;
print "TEXT[$text]\n" if $DEBUG;
my $tag = $self->{tag};
if (!$tag && $text =~ /\S/) {
# Normal text means start of body
$self->eof;
return;
}
return if $tag ne 'title';
$self->{'text'} .= $text;
}
This is where the byte order mark is detected and the process stops since
it's text outside any tag. So I've just added an extra term to the if
statement thus:
sub text
{
my($self, $text) = @_;
print "TEXT[$text]\n" if $DEBUG;
my $tag = $self->{tag};
if (!$tag && $text =~ /\S/ && !BOM($text)) {
# Normal text means start of body
$self->eof;
return;
}
return if $tag ne 'title';
$self->{'text'} .= $text;
}
And defined a little routine thus:
sub BOM {
my $text = shift;
my $top1 = unpack("C", substr($text, 0, 1));
my $top2 = unpack("C", substr($text, 1, 1));
my $top3 = unpack("C", substr($text, 2, 1));
my $top4 = unpack("C", substr($text, 3, 1));
# UTF-8
if($top1 eq 239 && $top2 eq 187 && $top3 eq 191) {
return 'UTF-8';
}
# UTF-16 little endian
if($top1 eq 255 && $top2 eq 254) {
return 'UTF-16 little endian';
}
# UTF-16 big endian
if($top1 eq 254 && $top2 eq 255) {
return 'UTF-16 big endian';
}
# UTF-32 little endian
if($top1 eq 255 && $top2 eq 254 && $top3 eq 0 && $top4 eq 0) {
return 'UTF-32 little endian';
}
# UTF-32 big endian
if($top1 eq 254 && $top2 eq 255 && $top3 eq 0 && $top4 eq 0) {
return 'UTF-32 big endian';
}
return 0;
}
This is an adaptation of a routine found at
http://dev.w3.org/cvsweb/p3p-validator/20001215/xml.pl?rev=1.5.
I have not been able to test this on any BOMs other than UTF-8. If you use a
BOM other than that, I'd be very pleased to hear of it.
The changes are in place on the ICRA label tester:
www.icra.org/label/tester/ (this looks for PICS labels in the headers, hence
the importance of this bit of LWP for me!). In the original e-mail for this
thread I gave the following two examples, both of which now work correctly:
An example of a site with a BOM that previously showed as having no headers
but is now OK:
http://www.icra.org/cgi-bin/labelTester.cgi?lang=EN&url=http%3A%2F%2Fwww.xtranslations.com&showHead=on&showContent=on
An example of a site with a label without a BOM (that still works as it
should!)
http://www.icra.org/cgi-bin/labelTester.cgi?lang=EN&url=http%3A%2F%2Fwww.yahoo.com&showHead=on&showContent=on
Phil Archer
Chief Technical Officer
Internet Content Rating Association
Label your site today at http://www.icra.org
----- Original Message -----
From: "Phil Archer" <phil.archer@xxxxxxxx>
To: "libwww list" <libwww@xxxxxxxx>
Sent: Thursday, October 07, 2004 10:11 AM
Subject: Byte Order Mark mucks up headers
Hi,
I've read Sean Burke's book, I've looked through the archives of this list
and done other searches but can't find an answer to a problem I have found
with LWP. If the character coding for a website has a byte order mark
(things like utf-16, all that "big endian/little endian" stuff) then LWP
can't interpret HTML headers in the usual way. Does anyone know a way
around this?
Background:
I work for an organisation called ICRA. We provide a self-labelling and
filtering system for the web, currently based on the old PICS standard but
soon to move to RDF. A couple of years ago I built a tool for our website
that visits a site, checks for PICS labels and parses them if found. Now,
I can strip out the BOM from the content where found and do other clunky
processing but that would mean I can't use LWP's efficient header
commands. For sites without a BOM I can just get header->('Pics-label')
and process that.
You can see the label tester at www.icra.org/label/tester/
An example of a site with a BOM that shows as unlabelled even though it
is:
http://www.icra.org/cgi-bin/labelTester.cgi?lang=EN&url=http%3A%2F%2Fwww.xtranslations.com&showHead=on&showContent=on
An example of a site with a label without a BOM (i.e one that works as it
should) would be
http://www.icra.org/cgi-bin/labelTester.cgi?lang=EN&url=http%3A%2F%2Fwww.yahoo.com&showHead=on&showContent=on
Any help gratefully accepted.
Phil.
Phil Archer
Chief Technical Officer
Internet Content Rating Association
Label your site today at http://www.icra.org
|