logo       

[ tidy-Bugs-1161797 ] --word-2000 always outputs numeric entities: msg#00063

web.html-tidy.tracker

Subject: [ tidy-Bugs-1161797 ] --word-2000 always outputs numeric entities

Bugs item #1161797, was opened at 2005-03-12 04:04
Message generated for change (Comment added) made by hoehrmann
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1161797&group_id=27659

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Harriet Bazley (harriet)
Assigned to: Nobody/Anonymous (nobody)
Summary: --word-2000 always outputs numeric entities

Initial Comment:
Version reports itself as "HTML Tidy for RISC OS released on 1st December 2004"
- no discernible version number....


The --word-2000 option seems to override the --numeric-entities option; even
with an explicit "--numeric-entities no" in the command line, ASCII characters
with the top bit set (specifically, the Windows 'smart' quotes present in just
about every Microsoft Word document, which look dreadful in a non-Windows
character set) are translated as &#8230 etc, rather than the relevant named
entities.

This means I can *either* strip out the Word-generated rubbish *or* use named
entities, but not both :-(


----------------------------------------------------------------------

>Comment By: Björn Höhrmann (hoehrmann)
Date: 2005-08-18 22:34

Message:
Logged In: YES
user_id=188003

Tidy actually uses numeric references here for portability,
in particular if the output is XHTML there is no other
option, it's not allowed to use named entities without also
having a DTD. The situation for HTML is basically the same
but browsers care less. I'm not sure there is anything we
can do about this.

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2005-04-09 23:17

Message:
Logged In: NO

simpify

----------------------------------------------------------------------

Comment By: Harriet Bazley (harriet)
Date: 2005-03-13 23:34

Message:
Logged In: YES
user_id=208570

I had a nasty suspicion the behaviour was by design; however, since I need to
use named entities for portability, this makes the translation rather less than
useful :-(

----------------------------------------------------------------------

Comment By: Björn Höhrmann (hoehrmann)
Date: 2005-03-13 04:17

Message:
Logged In: YES
user_id=188003

Tidy does not output a document type declaration by default
if there are proprietary elmements and/or attributes in the
document (and thus no document type declaration could be
applicable to the document), so no doctype and so no named
entity references is by design. Whether --doctype loose
should have different behavior I do not know really...

----------------------------------------------------------------------

Comment By: Harriet Bazley (harriet)
Date: 2005-03-13 03:40

Message:
Logged In: YES
user_id=208570

If I specify an explicit "--doctype loose" Tidy does output a document type
declaration. However, it still uses numbered entities (see attached output).

----------------------------------------------------------------------

Comment By: Harriet Bazley (harriet)
Date: 2005-03-12 21:11

Message:
Logged In: YES
user_id=208570

In the nature of things such documents tend to be enormous; however, I've
snipped one down to a single representative paragraph (plus screeds of
MS-specific header) and attached it.

The output (from '*Tidy --word-2000 yes --numeric-entities no test/html')
*doesn't* include a document type declaration, despite the fact that the first
warning generated is "missing <!DOCTYPE> declaration". I've tried specifying
a '--doctype auto' parameter, but this doesn't have any effect.

I'm surprised, since in my previous experience Tidy *does* insert a doctype
where this is missing....

----------------------------------------------------------------------

Comment By: Björn Höhrmann (hoehrmann)
Date: 2005-03-12 05:56

Message:
Logged In: YES
user_id=188003

It would help if you attach a simple test case. HTML Tidy
will only output named entity references if it outputs a
document type declaration, as you'd otherwise get
references to undefined entities which would confuse both
XML and SGML processors. So, unless the output includes a
document type declaration this is not a bug.

----------------------------------------------------------------------

You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1161797&group_id=27659


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise