osdir.com
mailing list archive

Subject: [ tidy-Bugs-1636028 ] Tidy API function tidyNodeGetText escapes output - msg#00044

List: web.html-tidy.tracker

Date: Prev Next Index Thread: Prev Next Index
Bugs item #1636028, was opened at 2007-01-15 17:27
Message generated for change (Settings changed) made by hoehrmann
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1636028&group_id=27659

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: TidyLib APIs
Group: Current - all platforms
Status: Open
Resolution: None
Priority: 6
Private: No
Submitted By: ayermakov (ayermakov)
>Assigned to: Nobody/Anonymous (nobody)
Summary: Tidy API function tidyNodeGetText escapes output

Initial Comment:
We use tidy mainly through its API, we walk parse tree and extract information
about each node in the tree, and transform it in our own tree structure. The
issue happens when we use tidyNodeGetText() on the TidyNode_Text,
TidyNode_Comment and TidyNode_CDATA type of node.

Consider the input:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<TITLE> New Document </TITLE>
<SCRIPT type="text/javascript">document.write("<STYLE>");</SCRIPT>
</HEAD>

<BODY>
Hello!
</BODY>
</HTML>

I'm walking the tidy parse tree and print out a text of each node (for
simplicity reasons). Here is a small function to walk the tree:

static void processNode(TidyDoc tdoc, TidyNode tnod)
{
TidyNode child;

for ( child = tidyGetChild(tnod); child; child = tidyGetNext(child) )
{
ctmbstr name = tidyNodeGetName( child );
TidyNodeType type = tidyNodeGetType(child);
switch ( type )
{
case TidyNode_Comment:
{
TidyBuffer text = {0};
tidyBufInit(&text);
if (tidyNodeGetText(tdoc, child, &text))
{
printf("TidyNode_Comment: %s\n",
text.bp);
tidyBufFree(&text);
}
}
break;
case TidyNode_Text:
{
TidyBuffer text = {0};
tidyBufInit(&text);
if (tidyNodeGetText(tdoc, child, &text))
{
printf("TidyNode_Text: %s\n", text.bp);
tidyBufFree(&text);
}
}
break;
case TidyNode_CDATA:
{
TidyBuffer text = {0};
tidyBufInit(&text);
if (tidyNodeGetText(tdoc, child, &text))
{
printf("TidyNode_CDATA: %s\n", text.bp);
tidyBufFree(&text);
}
}
break;
case TidyNode_Start:
{
processNode( tdoc, child);
}
break;
default:
break;
}
}
}

I call the function as follows:

processNode(tdoc, tidyGetRoot(tdoc));


The output of the function is:

TidyNode_Text: New Document
TidyNode_Text: document.write("&lt;STYLE&gt;");
TidyNode_Text: Hello!

Note that angle brackets converted to sgml entities. It's not clear why, seems
a bug to me.

Even if there is some reason behind this, we would like to get an undistorted
original text without escaping. Is there any option to do that?

----------------------------------------------------------------------

Comment By: BjÃrn HÃhrmann (hoehrmann)
Date: 2007-01-17 00:02

Message:
Logged In: YES
user_id=188003
Originator: NO

tidyNodeGetText is supposed to partially serialize a node, it is not meant
to get an element's text content, I think it would be incorrect for it to
not escape special characters in text. We lack a function to get the
content of text nodes, see the Jan 2003 thread "How to access lexbuf?" on
tidy-develop. I think the addition of such a function would address the
requestor's problem.

If tidyNodeGetText is to improved indepentently of the addition of such a
function, it should continue to escape normal text nodes, whether CDATA
element content like for <script> and <style> is escaped should depend on
whether XML/XHTML output is requested, and the content of comments and PIs
should probably never be escaped as if it was text.

----------------------------------------------------------------------

Comment By: Arnaud Desitter (arnaud02)
Date: 2007-01-16 23:35

Message:
Logged In: YES
user_id=566665
Originator: NO

Bjoern,
Could you comment on this patch ?
Thanks,

----------------------------------------------------------------------

Comment By: ayermakov (ayermakov)
Date: 2007-01-16 14:58

Message:
Logged In: YES
user_id=1688233
Originator: YES

Well, seems the bug entry 1166491 describes the same issue. However I do
have a slightly different opinion how it should be resolved. I believe that
tidyNodeGetText should output text of any node 'as-is', without any
processing (escaping). It's true not only for script and style type of
node, but also for a regular html text.
Or at least it should be under control of some option.
File Added: 1636028.diff

----------------------------------------------------------------------

Comment By: Arnaud Desitter (arnaud02)
Date: 2007-01-16 09:44

Message:
Logged In: YES
user_id=566665
Originator: NO

See http://tidy.sf.net/issue/1166491 which contains a patch that may be
correct.
It would be nice if you could provide a patch with a rationale so this
issue could be nailed down.

----------------------------------------------------------------------

You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1636028&group_id=27659

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Tidy-tracker mailing list
Tidy-tracker@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/tidy-tracker
Was this page helpful?
Yes No
Thread at a glance:

Previous Message by Date: click to view message preview

[ tidy-Bugs-1642186 ] Parser too greedy over <script> blocks

Bugs item #1642186, was opened at 2007-01-23 07:21 Message generated for change (Settings changed) made by hoehrmann You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1642186&group_id=27659 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: HTML/XHTML Parser Group: Current - all platforms Status: Open Resolution: None Priority: 5 Private: No Submitted By: Nobody/Anonymous (nobody) >Assigned to: Nobody/Anonymous (nobody) Summary: Parser too greedy over <script> blocks Initial Comment: Input: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"> <html> <head><title></title> <body> <script type="text/javascript"> "<script" </script> </body> </html> Output: D:\Misc\qc>tidy test.html line 7 column 15 - Warning: '<' + '/' + letter not allowed here line 8 column 5 - Warning: '<' + '/' + letter not allowed here line 9 column 5 - Warning: '<' + '/' + letter not allowed here line 5 column 9 - Warning: missing </script> line 5 column 9 - Warning: missing </script> Info: Doctype given is "-//W3C//DTD HTML 4.01//EN" Info: Document content looks like HTML 4.01 Strict Info: No system identifier in emitted doctype 5 warnings, 0 errors were found! <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"> <html> <head> <meta name="generator" content= "HTML Tidy for Windows (vers 14 February 2006), see www.w3.org"> <title></title> </head> <body> <script type="text/javascript"> "<script" <\/script> <\/body> <\/html> </script> </body> </html> To learn more about HTML Tidy see http://tidy.sourceforge.net Please send bug reports to html-tidy@xxxxxx HTML and CSS specifications are available from http://www.w3.org/ Lobby your company to join W3C, see http://www.w3.org/Consortium As you can see the tidy'ed output is worse than the original. If you need anything else from me drop me an email at nate at redtetrahedron.org ---------------------------------------------------------------------- Comment By: Geoff (geoffmc) Date: 2007-01-25 19:03 Message: Logged In: YES user_id=1408861 Originator: NO See patch - http://tidy.sf.net/issue/1644645 ---------------------------------------------------------------------- Comment By: BjÃrn HÃhrmann (hoehrmann) Date: 2007-01-23 11:06 Message: Logged In: YES user_id=188003 Originator: NO Better <script> parsing algorithms would certainly be most welcome. ---------------------------------------------------------------------- Comment By: Arnaud Desitter (arnaud02) Date: 2007-01-23 10:58 Message: Logged In: YES user_id=566665 Originator: NO I think this is by design although it could be revisited. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1642186&group_id=27659 ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Tidy-tracker mailing list Tidy-tracker@xxxxxxxxxxxxxxxxxxxxx https://lists.sourceforge.net/lists/listinfo/tidy-tracker

Next Message by Date: click to view message preview

[ tidy-Bugs-1604555 ] PRE block reordering produces visually different output

Bugs item #1604555, was opened at 2006-11-28 17:18 Message generated for change (Comment added) made by hoehrmann You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1604555&group_id=27659 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: HTML/XHTML Parser Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Christopher M. Woods (cmwoods) Assigned to: Nobody/Anonymous (nobody) Summary: PRE block reordering produces visually different output Initial Comment: I've stumbled onto a situation in which Tidying an HTML file (of which I have no control) is producing output that renders visually inconsistent when compared to the input document. It appears that Tidy is taking: ... <PRE> <DIV> Content Data Spaced Data </DIV> </PRE> and rearranging it to be: <DIV> Content Data Spaced Data </DIV> <PRE> </PRE> For both MSIE (7.0) and Firefox (2.0) are in agreement with how they render these two snippets. The "Content Data" block in the former appears as being within the PRE block (courier new, 10pt, whitespace maintained); in the latter it appears as being within a DIV block (default font, 16pt, whitespace compressed). At first glance, it appears that these browsers are discarding the DIV. However more extensive testing will be needed in order to determine if it is completely discarded, just moved as an empty block, or whether the block actually remains (but that the browsers override the properties as needed). I've attached a sample with input, output, log, and tree representation. ---------------------------------------------------------------------- >Comment By: BjÃrn HÃhrmann (hoehrmann) Date: 2007-05-25 05:46 Message: Logged In: YES user_id=188003 Originator: NO What do you suggest should Tidy's output look like? ---------------------------------------------------------------------- Comment By: Christopher M. Woods (cmwoods) Date: 2006-12-06 18:03 Message: Logged In: YES user_id=576763 Originator: YES I just noticed that Sourceforge removed the extra whitespace in my bug report that was supposed to appear between the "Content Data" and the "Spaced Data". This was supposed to show that the issue is resulting in the loss of significant whitespace of a pre-formatted block due to the fact that the enclosed DIV is being moved to _before_ the PRE. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1604555&group_id=27659 ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Tidy-tracker mailing list Tidy-tracker@xxxxxxxxxxxxxxxxxxxxx https://lists.sourceforge.net/lists/listinfo/tidy-tracker

Previous Message by Thread: click to view message preview

[ tidy-Bugs-1642186 ] Parser too greedy over <script> blocks

Bugs item #1642186, was opened at 2007-01-23 07:21 Message generated for change (Settings changed) made by hoehrmann You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1642186&group_id=27659 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: HTML/XHTML Parser Group: Current - all platforms Status: Open Resolution: None Priority: 5 Private: No Submitted By: Nobody/Anonymous (nobody) >Assigned to: Nobody/Anonymous (nobody) Summary: Parser too greedy over <script> blocks Initial Comment: Input: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"> <html> <head><title></title> <body> <script type="text/javascript"> "<script" </script> </body> </html> Output: D:\Misc\qc>tidy test.html line 7 column 15 - Warning: '<' + '/' + letter not allowed here line 8 column 5 - Warning: '<' + '/' + letter not allowed here line 9 column 5 - Warning: '<' + '/' + letter not allowed here line 5 column 9 - Warning: missing </script> line 5 column 9 - Warning: missing </script> Info: Doctype given is "-//W3C//DTD HTML 4.01//EN" Info: Document content looks like HTML 4.01 Strict Info: No system identifier in emitted doctype 5 warnings, 0 errors were found! <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"> <html> <head> <meta name="generator" content= "HTML Tidy for Windows (vers 14 February 2006), see www.w3.org"> <title></title> </head> <body> <script type="text/javascript"> "<script" <\/script> <\/body> <\/html> </script> </body> </html> To learn more about HTML Tidy see http://tidy.sourceforge.net Please send bug reports to html-tidy@xxxxxx HTML and CSS specifications are available from http://www.w3.org/ Lobby your company to join W3C, see http://www.w3.org/Consortium As you can see the tidy'ed output is worse than the original. If you need anything else from me drop me an email at nate at redtetrahedron.org ---------------------------------------------------------------------- Comment By: Geoff (geoffmc) Date: 2007-01-25 19:03 Message: Logged In: YES user_id=1408861 Originator: NO See patch - http://tidy.sf.net/issue/1644645 ---------------------------------------------------------------------- Comment By: BjÃrn HÃhrmann (hoehrmann) Date: 2007-01-23 11:06 Message: Logged In: YES user_id=188003 Originator: NO Better <script> parsing algorithms would certainly be most welcome. ---------------------------------------------------------------------- Comment By: Arnaud Desitter (arnaud02) Date: 2007-01-23 10:58 Message: Logged In: YES user_id=566665 Originator: NO I think this is by design although it could be revisited. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1642186&group_id=27659 ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Tidy-tracker mailing list Tidy-tracker@xxxxxxxxxxxxxxxxxxxxx https://lists.sourceforge.net/lists/listinfo/tidy-tracker

Next Message by Thread: click to view message preview

[ tidy-Bugs-1604555 ] PRE block reordering produces visually different output

Bugs item #1604555, was opened at 2006-11-28 17:18 Message generated for change (Comment added) made by hoehrmann You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1604555&group_id=27659 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: HTML/XHTML Parser Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Christopher M. Woods (cmwoods) Assigned to: Nobody/Anonymous (nobody) Summary: PRE block reordering produces visually different output Initial Comment: I've stumbled onto a situation in which Tidying an HTML file (of which I have no control) is producing output that renders visually inconsistent when compared to the input document. It appears that Tidy is taking: ... <PRE> <DIV> Content Data Spaced Data </DIV> </PRE> and rearranging it to be: <DIV> Content Data Spaced Data </DIV> <PRE> </PRE> For both MSIE (7.0) and Firefox (2.0) are in agreement with how they render these two snippets. The "Content Data" block in the former appears as being within the PRE block (courier new, 10pt, whitespace maintained); in the latter it appears as being within a DIV block (default font, 16pt, whitespace compressed). At first glance, it appears that these browsers are discarding the DIV. However more extensive testing will be needed in order to determine if it is completely discarded, just moved as an empty block, or whether the block actually remains (but that the browsers override the properties as needed). I've attached a sample with input, output, log, and tree representation. ---------------------------------------------------------------------- >Comment By: BjÃrn HÃhrmann (hoehrmann) Date: 2007-05-25 05:46 Message: Logged In: YES user_id=188003 Originator: NO What do you suggest should Tidy's output look like? ---------------------------------------------------------------------- Comment By: Christopher M. Woods (cmwoods) Date: 2006-12-06 18:03 Message: Logged In: YES user_id=576763 Originator: YES I just noticed that Sourceforge removed the extra whitespace in my bug report that was supposed to appear between the "Content Data" and the "Spaced Data". This was supposed to show that the issue is resulting in the loss of significant whitespace of a pre-formatted block due to the fact that the enclosed DIV is being moved to _before_ the PRE. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1604555&group_id=27659 ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Tidy-tracker mailing list Tidy-tracker@xxxxxxxxxxxxxxxxxxxxx https://lists.sourceforge.net/lists/listinfo/tidy-tracker
Loading Comments...
Home | News | Patents | Sitemap | FAQ | advertise

Advertising by