osdir.com
mailing list archive

Subject: [ tidy-Bugs-1161797 ] --word-2000 always outputs numeric entities - msg#00084

List: web.html-tidy.tracker

Date: Prev Next Index Thread: Prev Next Index
Bugs item #1161797, was opened at 2005-03-12 04:04
Message generated for change (Comment added) made by hoehrmann
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1161797&group_id=27659

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Harriet Bazley (harriet)
Assigned to: Nobody/Anonymous (nobody)
Summary: --word-2000 always outputs numeric entities

Initial Comment:
Version reports itself as "HTML Tidy for RISC OS released on 1st December 2004"
- no discernible version number....


The --word-2000 option seems to override the --numeric-entities option; even
with an explicit "--numeric-entities no" in the command line, ASCII characters
with the top bit set (specifically, the Windows 'smart' quotes present in just
about every Microsoft Word document, which look dreadful in a non-Windows
character set) are translated as &#8230 etc, rather than the relevant named
entities.

This means I can *either* strip out the Word-generated rubbish *or* use named
entities, but not both :-(


----------------------------------------------------------------------

>Comment By: Björn Höhrmann (hoehrmann)
Date: 2005-03-13 04:17

Message:
Logged In: YES
user_id=188003

Tidy does not output a document type declaration by default
if there are proprietary elmements and/or attributes in the
document (and thus no document type declaration could be
applicable to the document), so no doctype and so no named
entity references is by design. Whether --doctype loose
should have different behavior I do not know really...

----------------------------------------------------------------------

Comment By: Harriet Bazley (harriet)
Date: 2005-03-13 03:40

Message:
Logged In: YES
user_id=208570

If I specify an explicit "--doctype loose" Tidy does output a document type
declaration. However, it still uses numbered entities (see attached output).

----------------------------------------------------------------------

Comment By: Harriet Bazley (harriet)
Date: 2005-03-12 21:11

Message:
Logged In: YES
user_id=208570

In the nature of things such documents tend to be enormous; however, I've
snipped one down to a single representative paragraph (plus screeds of
MS-specific header) and attached it.

The output (from '*Tidy --word-2000 yes --numeric-entities no test/html')
*doesn't* include a document type declaration, despite the fact that the first
warning generated is "missing <!DOCTYPE> declaration". I've tried specifying
a '--doctype auto' parameter, but this doesn't have any effect.

I'm surprised, since in my previous experience Tidy *does* insert a doctype
where this is missing....

----------------------------------------------------------------------

Comment By: Björn Höhrmann (hoehrmann)
Date: 2005-03-12 05:56

Message:
Logged In: YES
user_id=188003

It would help if you attach a simple test case. HTML Tidy
will only output named entity references if it outputs a
document type declaration, as you'd otherwise get
references to undefined entities which would confuse both
XML and SGML processors. So, unless the output includes a
document type declaration this is not a bug.

----------------------------------------------------------------------

You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1161797&group_id=27659


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click


Was this page helpful?
Yes No
Thread at a glance:

Previous Message by Date: click to view message preview

[ tidy-Bugs-1161797 ] --word-2000 always outputs numeric entities

Bugs item #1161797, was opened at 2005-03-12 03:04 Message generated for change (Comment added) made by harriet You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1161797&group_id=27659 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Harriet Bazley (harriet) Assigned to: Nobody/Anonymous (nobody) Summary: --word-2000 always outputs numeric entities Initial Comment: Version reports itself as "HTML Tidy for RISC OS released on 1st December 2004" - no discernible version number.... The --word-2000 option seems to override the --numeric-entities option; even with an explicit "--numeric-entities no" in the command line, ASCII characters with the top bit set (specifically, the Windows 'smart' quotes present in just about every Microsoft Word document, which look dreadful in a non-Windows character set) are translated as &#8230 etc, rather than the relevant named entities. This means I can *either* strip out the Word-generated rubbish *or* use named entities, but not both :-( ---------------------------------------------------------------------- >Comment By: Harriet Bazley (harriet) Date: 2005-03-13 02:40 Message: Logged In: YES user_id=208570 If I specify an explicit "--doctype loose" Tidy does output a document type declaration. However, it still uses numbered entities (see attached output). ---------------------------------------------------------------------- Comment By: Harriet Bazley (harriet) Date: 2005-03-12 20:11 Message: Logged In: YES user_id=208570 In the nature of things such documents tend to be enormous; however, I've snipped one down to a single representative paragraph (plus screeds of MS-specific header) and attached it. The output (from '*Tidy --word-2000 yes --numeric-entities no test/html') *doesn't* include a document type declaration, despite the fact that the first warning generated is "missing <!DOCTYPE> declaration". I've tried specifying a '--doctype auto' parameter, but this doesn't have any effect. I'm surprised, since in my previous experience Tidy *does* insert a doctype where this is missing.... ---------------------------------------------------------------------- Comment By: Björn Höhrmann (hoehrmann) Date: 2005-03-12 04:56 Message: Logged In: YES user_id=188003 It would help if you attach a simple test case. HTML Tidy will only output named entity references if it outputs a document type declaration, as you'd otherwise get references to undefined entities which would confuse both XML and SGML processors. So, unless the output includes a document type declaration this is not a bug. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1161797&group_id=27659 ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

Next Message by Date: click to view message preview

[ tidy-Feature Requests-1162057 ] Javascript version?

Feature Requests item #1162057, was opened at 2005-03-12 18:52 Message generated for change (Comment added) made by hoehrmann You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=390966&aid=1162057&group_id=27659 Category: Source portability Group: None Status: Open Priority: 5 Submitted By: Nobody/Anonymous (nobody) Assigned to: Nobody/Anonymous (nobody) Summary: Javascript version? Initial Comment: I recently disovered the XMLHttpRequest object in IE 5+ for generating http requests to get XML documents from other servers. I also discovered it allows getting HTML documents, but the returned page is stored only as a string. I am looking for a way to take the html string and be able to "tidy" it up into valid xhtml so that I can then create a DOM object from it. I have been so far unsuccessful in finding something that can take a string in javascript and do the tidying and thought i'd make a suggestion to you as a future request to make a javascript version of HTML Tidy so client side programmers can convert strings of HTML on the fly in our browsers without using some server side process. thanks! ---------------------------------------------------------------------- >Comment By: Björn Höhrmann (hoehrmann) Date: 2005-03-13 04:22 Message: Logged In: YES user_id=188003 This would require that Tidy is available on the client system; few users have Tidy installed and even fewer users would allow web pages to execute it even if installed. So you should either use a server-side process that allows you to use Tidy or you indeed need a JavaScript version of Tidy. That's a complex task and out of scope of this project. However, you might be able to use JTidy from a Java-Applet, so I would suggest to investigate that instead if using a server-side process is really not an option. Finally, core functionality is easily portable to browsers, you can typically create a new document, write the HTML content to that document and traverse the resulting tree, building a string that represents XHTML code. In that process you can also easily strip out unwanted elements and attributes and check for other errors. This might be a good alternative, too. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=390966&aid=1162057&group_id=27659 ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

Previous Message by Thread: click to view message preview

[ tidy-Bugs-1161797 ] --word-2000 always outputs numeric entities

Bugs item #1161797, was opened at 2005-03-12 03:04 Message generated for change (Comment added) made by harriet You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1161797&group_id=27659 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Harriet Bazley (harriet) Assigned to: Nobody/Anonymous (nobody) Summary: --word-2000 always outputs numeric entities Initial Comment: Version reports itself as "HTML Tidy for RISC OS released on 1st December 2004" - no discernible version number.... The --word-2000 option seems to override the --numeric-entities option; even with an explicit "--numeric-entities no" in the command line, ASCII characters with the top bit set (specifically, the Windows 'smart' quotes present in just about every Microsoft Word document, which look dreadful in a non-Windows character set) are translated as &#8230 etc, rather than the relevant named entities. This means I can *either* strip out the Word-generated rubbish *or* use named entities, but not both :-( ---------------------------------------------------------------------- >Comment By: Harriet Bazley (harriet) Date: 2005-03-13 02:40 Message: Logged In: YES user_id=208570 If I specify an explicit "--doctype loose" Tidy does output a document type declaration. However, it still uses numbered entities (see attached output). ---------------------------------------------------------------------- Comment By: Harriet Bazley (harriet) Date: 2005-03-12 20:11 Message: Logged In: YES user_id=208570 In the nature of things such documents tend to be enormous; however, I've snipped one down to a single representative paragraph (plus screeds of MS-specific header) and attached it. The output (from '*Tidy --word-2000 yes --numeric-entities no test/html') *doesn't* include a document type declaration, despite the fact that the first warning generated is "missing <!DOCTYPE> declaration". I've tried specifying a '--doctype auto' parameter, but this doesn't have any effect. I'm surprised, since in my previous experience Tidy *does* insert a doctype where this is missing.... ---------------------------------------------------------------------- Comment By: Björn Höhrmann (hoehrmann) Date: 2005-03-12 04:56 Message: Logged In: YES user_id=188003 It would help if you attach a simple test case. HTML Tidy will only output named entity references if it outputs a document type declaration, as you'd otherwise get references to undefined entities which would confuse both XML and SGML processors. So, unless the output includes a document type declaration this is not a bug. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1161797&group_id=27659 ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

Next Message by Thread: click to view message preview

[ tidy-Bugs-1161797 ] --word-2000 always outputs numeric entities

Bugs item #1161797, was opened at 2005-03-12 03:04 Message generated for change (Comment added) made by harriet You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1161797&group_id=27659 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Harriet Bazley (harriet) Assigned to: Nobody/Anonymous (nobody) Summary: --word-2000 always outputs numeric entities Initial Comment: Version reports itself as "HTML Tidy for RISC OS released on 1st December 2004" - no discernible version number.... The --word-2000 option seems to override the --numeric-entities option; even with an explicit "--numeric-entities no" in the command line, ASCII characters with the top bit set (specifically, the Windows 'smart' quotes present in just about every Microsoft Word document, which look dreadful in a non-Windows character set) are translated as &#8230 etc, rather than the relevant named entities. This means I can *either* strip out the Word-generated rubbish *or* use named entities, but not both :-( ---------------------------------------------------------------------- >Comment By: Harriet Bazley (harriet) Date: 2005-03-13 22:34 Message: Logged In: YES user_id=208570 I had a nasty suspicion the behaviour was by design; however, since I need to use named entities for portability, this makes the translation rather less than useful :-( ---------------------------------------------------------------------- Comment By: Björn Höhrmann (hoehrmann) Date: 2005-03-13 03:17 Message: Logged In: YES user_id=188003 Tidy does not output a document type declaration by default if there are proprietary elmements and/or attributes in the document (and thus no document type declaration could be applicable to the document), so no doctype and so no named entity references is by design. Whether --doctype loose should have different behavior I do not know really... ---------------------------------------------------------------------- Comment By: Harriet Bazley (harriet) Date: 2005-03-13 02:40 Message: Logged In: YES user_id=208570 If I specify an explicit "--doctype loose" Tidy does output a document type declaration. However, it still uses numbered entities (see attached output). ---------------------------------------------------------------------- Comment By: Harriet Bazley (harriet) Date: 2005-03-12 20:11 Message: Logged In: YES user_id=208570 In the nature of things such documents tend to be enormous; however, I've snipped one down to a single representative paragraph (plus screeds of MS-specific header) and attached it. The output (from '*Tidy --word-2000 yes --numeric-entities no test/html') *doesn't* include a document type declaration, despite the fact that the first warning generated is "missing <!DOCTYPE> declaration". I've tried specifying a '--doctype auto' parameter, but this doesn't have any effect. I'm surprised, since in my previous experience Tidy *does* insert a doctype where this is missing.... ---------------------------------------------------------------------- Comment By: Björn Höhrmann (hoehrmann) Date: 2005-03-12 04:56 Message: Logged In: YES user_id=188003 It would help if you attach a simple test case. HTML Tidy will only output named entity references if it outputs a document type declaration, as you'd otherwise get references to undefined entities which would confuse both XML and SGML processors. So, unless the output includes a document type declaration this is not a bug. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1161797&group_id=27659 ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
Sign up for updates to this mailing list. email:
Loading Comments...
Home | News | Patents | Sitemap | FAQ | advertise

Advertising by