Bugs item #1161797, was opened at 2005-03-12 04:04
Message generated for change (Comment added) made by hoehrmann
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1161797&group_id=27659
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Harriet Bazley (harriet)
Assigned to: Nobody/Anonymous (nobody)
Summary: --word-2000 always outputs numeric entities
Initial Comment:
Version reports itself as "HTML Tidy for RISC OS released on 1st December 2004"
- no discernible version number....
The --word-2000 option seems to override the --numeric-entities option; even
with an explicit "--numeric-entities no" in the command line, ASCII characters
with the top bit set (specifically, the Windows 'smart' quotes present in just
about every Microsoft Word document, which look dreadful in a non-Windows
character set) are translated as … etc, rather than the relevant named
entities.
This means I can *either* strip out the Word-generated rubbish *or* use named
entities, but not both :-(
----------------------------------------------------------------------
>
Comment By: Björn Höhrmann (hoehrmann)
Date: 2005-03-13 04:17
Message:
Logged In: YES
user_id=188003
Tidy does not output a document type declaration by default
if there are proprietary elmements and/or attributes in the
document (and thus no document type declaration could be
applicable to the document), so no doctype and so no named
entity references is by design. Whether --doctype loose
should have different behavior I do not know really...
----------------------------------------------------------------------
Comment By: Harriet Bazley (harriet)
Date: 2005-03-13 03:40
Message:
Logged In: YES
user_id=208570
If I specify an explicit "--doctype loose" Tidy does output a document type
declaration. However, it still uses numbered entities (see attached output).
----------------------------------------------------------------------
Comment By: Harriet Bazley (harriet)
Date: 2005-03-12 21:11
Message:
Logged In: YES
user_id=208570
In the nature of things such documents tend to be enormous; however, I've
snipped one down to a single representative paragraph (plus screeds of
MS-specific header) and attached it.
The output (from '*Tidy --word-2000 yes --numeric-entities no test/html')
*doesn't* include a document type declaration, despite the fact that the first
warning generated is "missing <!DOCTYPE> declaration". I've tried specifying
a '--doctype auto' parameter, but this doesn't have any effect.
I'm surprised, since in my previous experience Tidy *does* insert a doctype
where this is missing....
----------------------------------------------------------------------
Comment By: Björn Höhrmann (hoehrmann)
Date: 2005-03-12 05:56
Message:
Logged In: YES
user_id=188003
It would help if you attach a simple test case. HTML Tidy
will only output named entity references if it outputs a
document type declaration, as you'd otherwise get
references to undefined entities which would confuse both
XML and SGML processors. So, unless the output includes a
document type declaration this is not a bug.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1161797&group_id=27659
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
Thread at a glance:
Previous Message by Date:
click to view message preview
[ tidy-Bugs-1161797 ] --word-2000 always outputs numeric entities
Bugs item #1161797, was opened at 2005-03-12 03:04
Message generated for change (Comment added) made by harriet
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1161797&group_id=27659
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Harriet Bazley (harriet)
Assigned to: Nobody/Anonymous (nobody)
Summary: --word-2000 always outputs numeric entities
Initial Comment:
Version reports itself as "HTML Tidy for RISC OS released on 1st December 2004"
- no discernible version number....
The --word-2000 option seems to override the --numeric-entities option; even
with an explicit "--numeric-entities no" in the command line, ASCII characters
with the top bit set (specifically, the Windows 'smart' quotes present in just
about every Microsoft Word document, which look dreadful in a non-Windows
character set) are translated as … etc, rather than the relevant named
entities.
This means I can *either* strip out the Word-generated rubbish *or* use named
entities, but not both :-(
----------------------------------------------------------------------
>Comment By: Harriet Bazley (harriet)
Date: 2005-03-13 02:40
Message:
Logged In: YES
user_id=208570
If I specify an explicit "--doctype loose" Tidy does output a document type
declaration. However, it still uses numbered entities (see attached output).
----------------------------------------------------------------------
Comment By: Harriet Bazley (harriet)
Date: 2005-03-12 20:11
Message:
Logged In: YES
user_id=208570
In the nature of things such documents tend to be enormous; however, I've
snipped one down to a single representative paragraph (plus screeds of
MS-specific header) and attached it.
The output (from '*Tidy --word-2000 yes --numeric-entities no test/html')
*doesn't* include a document type declaration, despite the fact that the first
warning generated is "missing <!DOCTYPE> declaration". I've tried specifying
a '--doctype auto' parameter, but this doesn't have any effect.
I'm surprised, since in my previous experience Tidy *does* insert a doctype
where this is missing....
----------------------------------------------------------------------
Comment By: Björn Höhrmann (hoehrmann)
Date: 2005-03-12 04:56
Message:
Logged In: YES
user_id=188003
It would help if you attach a simple test case. HTML Tidy
will only output named entity references if it outputs a
document type declaration, as you'd otherwise get
references to undefined entities which would confuse both
XML and SGML processors. So, unless the output includes a
document type declaration this is not a bug.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1161797&group_id=27659
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
Next Message by Date:
click to view message preview
[ tidy-Feature Requests-1162057 ] Javascript version?
Feature Requests item #1162057, was opened at 2005-03-12 18:52
Message generated for change (Comment added) made by hoehrmann
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390966&aid=1162057&group_id=27659
Category: Source portability
Group: None
Status: Open
Priority: 5
Submitted By: Nobody/Anonymous (nobody)
Assigned to: Nobody/Anonymous (nobody)
Summary: Javascript version?
Initial Comment:
I recently disovered the XMLHttpRequest object in IE 5+
for generating http requests to get XML documents from
other servers. I also discovered it allows getting HTML
documents, but the returned page is stored only as a
string. I am looking for a way to take the html string and
be able to "tidy" it up into valid xhtml so that I can then
create a DOM object from it.
I have been so far unsuccessful in finding something
that can take a string in javascript and do the tidying
and thought i'd make a suggestion to you as a future
request to make a javascript version of HTML Tidy so
client side programmers can convert strings of HTML on
the fly in our browsers without using some server side
process.
thanks!
----------------------------------------------------------------------
>Comment By: Björn Höhrmann (hoehrmann)
Date: 2005-03-13 04:22
Message:
Logged In: YES
user_id=188003
This would require that Tidy is available on the client
system; few users have Tidy installed and even fewer users
would allow web pages to execute it even if installed. So
you should either use a server-side process that allows you
to use Tidy or you indeed need a JavaScript version of
Tidy. That's a complex task and out of scope of this
project.
However, you might be able to use JTidy from a Java-Applet,
so I would suggest to investigate that instead if using a
server-side process is really not an option.
Finally, core functionality is easily portable to browsers,
you can typically create a new document, write the HTML
content to that document and traverse the resulting tree,
building a string that represents XHTML code. In that
process you can also easily strip out unwanted elements and
attributes and check for other errors. This might be a good
alternative, too.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390966&aid=1162057&group_id=27659
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
Previous Message by Thread:
click to view message preview
[ tidy-Bugs-1161797 ] --word-2000 always outputs numeric entities
Bugs item #1161797, was opened at 2005-03-12 03:04
Message generated for change (Comment added) made by harriet
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1161797&group_id=27659
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Harriet Bazley (harriet)
Assigned to: Nobody/Anonymous (nobody)
Summary: --word-2000 always outputs numeric entities
Initial Comment:
Version reports itself as "HTML Tidy for RISC OS released on 1st December 2004"
- no discernible version number....
The --word-2000 option seems to override the --numeric-entities option; even
with an explicit "--numeric-entities no" in the command line, ASCII characters
with the top bit set (specifically, the Windows 'smart' quotes present in just
about every Microsoft Word document, which look dreadful in a non-Windows
character set) are translated as … etc, rather than the relevant named
entities.
This means I can *either* strip out the Word-generated rubbish *or* use named
entities, but not both :-(
----------------------------------------------------------------------
>Comment By: Harriet Bazley (harriet)
Date: 2005-03-13 02:40
Message:
Logged In: YES
user_id=208570
If I specify an explicit "--doctype loose" Tidy does output a document type
declaration. However, it still uses numbered entities (see attached output).
----------------------------------------------------------------------
Comment By: Harriet Bazley (harriet)
Date: 2005-03-12 20:11
Message:
Logged In: YES
user_id=208570
In the nature of things such documents tend to be enormous; however, I've
snipped one down to a single representative paragraph (plus screeds of
MS-specific header) and attached it.
The output (from '*Tidy --word-2000 yes --numeric-entities no test/html')
*doesn't* include a document type declaration, despite the fact that the first
warning generated is "missing <!DOCTYPE> declaration". I've tried specifying
a '--doctype auto' parameter, but this doesn't have any effect.
I'm surprised, since in my previous experience Tidy *does* insert a doctype
where this is missing....
----------------------------------------------------------------------
Comment By: Björn Höhrmann (hoehrmann)
Date: 2005-03-12 04:56
Message:
Logged In: YES
user_id=188003
It would help if you attach a simple test case. HTML Tidy
will only output named entity references if it outputs a
document type declaration, as you'd otherwise get
references to undefined entities which would confuse both
XML and SGML processors. So, unless the output includes a
document type declaration this is not a bug.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1161797&group_id=27659
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
Next Message by Thread:
click to view message preview
[ tidy-Bugs-1161797 ] --word-2000 always outputs numeric entities
Bugs item #1161797, was opened at 2005-03-12 03:04
Message generated for change (Comment added) made by harriet
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1161797&group_id=27659
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Harriet Bazley (harriet)
Assigned to: Nobody/Anonymous (nobody)
Summary: --word-2000 always outputs numeric entities
Initial Comment:
Version reports itself as "HTML Tidy for RISC OS released on 1st December 2004"
- no discernible version number....
The --word-2000 option seems to override the --numeric-entities option; even
with an explicit "--numeric-entities no" in the command line, ASCII characters
with the top bit set (specifically, the Windows 'smart' quotes present in just
about every Microsoft Word document, which look dreadful in a non-Windows
character set) are translated as … etc, rather than the relevant named
entities.
This means I can *either* strip out the Word-generated rubbish *or* use named
entities, but not both :-(
----------------------------------------------------------------------
>Comment By: Harriet Bazley (harriet)
Date: 2005-03-13 22:34
Message:
Logged In: YES
user_id=208570
I had a nasty suspicion the behaviour was by design; however, since I need to
use named entities for portability, this makes the translation rather less than
useful :-(
----------------------------------------------------------------------
Comment By: Björn Höhrmann (hoehrmann)
Date: 2005-03-13 03:17
Message:
Logged In: YES
user_id=188003
Tidy does not output a document type declaration by default
if there are proprietary elmements and/or attributes in the
document (and thus no document type declaration could be
applicable to the document), so no doctype and so no named
entity references is by design. Whether --doctype loose
should have different behavior I do not know really...
----------------------------------------------------------------------
Comment By: Harriet Bazley (harriet)
Date: 2005-03-13 02:40
Message:
Logged In: YES
user_id=208570
If I specify an explicit "--doctype loose" Tidy does output a document type
declaration. However, it still uses numbered entities (see attached output).
----------------------------------------------------------------------
Comment By: Harriet Bazley (harriet)
Date: 2005-03-12 20:11
Message:
Logged In: YES
user_id=208570
In the nature of things such documents tend to be enormous; however, I've
snipped one down to a single representative paragraph (plus screeds of
MS-specific header) and attached it.
The output (from '*Tidy --word-2000 yes --numeric-entities no test/html')
*doesn't* include a document type declaration, despite the fact that the first
warning generated is "missing <!DOCTYPE> declaration". I've tried specifying
a '--doctype auto' parameter, but this doesn't have any effect.
I'm surprised, since in my previous experience Tidy *does* insert a doctype
where this is missing....
----------------------------------------------------------------------
Comment By: Björn Höhrmann (hoehrmann)
Date: 2005-03-12 04:56
Message:
Logged In: YES
user_id=188003
It would help if you attach a simple test case. HTML Tidy
will only output named entity references if it outputs a
document type declaration, as you'd otherwise get
references to undefined entities which would confuse both
XML and SGML processors. So, unless the output includes a
document type declaration this is not a bug.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1161797&group_id=27659
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click