Bugs item #1636028, was opened at 2007-01-15 16:27
Message generated for change (Comment added) made by jpcs
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1636028&group_id=27659
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: TidyLib APIs
Group: Current - all platforms
Status: Open
Resolution: None
Priority: 6
Private: No
Submitted By: ayermakov (ayermakov)
Assigned to: Nobody/Anonymous (nobody)
Summary: Tidy API function tidyNodeGetText escapes output
Initial Comment:
We use tidy mainly through its API, we walk parse tree and extract information
about each node in the tree, and transform it in our own tree structure. The
issue happens when we use tidyNodeGetText() on the TidyNode_Text,
TidyNode_Comment and TidyNode_CDATA type of node.
Consider the input:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<TITLE> New Document </TITLE>
<SCRIPT type="text/javascript">document.write("<STYLE>");</SCRIPT>
</HEAD>
<BODY>
Hello!
</BODY>
</HTML>
I'm walking the tidy parse tree and print out a text of each node (for
simplicity reasons). Here is a small function to walk the tree:
static void processNode(TidyDoc tdoc, TidyNode tnod)
{
TidyNode child;
for ( child = tidyGetChild(tnod); child; child = tidyGetNext(child) )
{
ctmbstr name = tidyNodeGetName( child );
TidyNodeType type = tidyNodeGetType(child);
switch ( type )
{
case TidyNode_Comment:
{
TidyBuffer text = {0};
tidyBufInit(&text);
if (tidyNodeGetText(tdoc, child, &text))
{
printf("TidyNode_Comment: %s\n",
text.bp);
tidyBufFree(&text);
}
}
break;
case TidyNode_Text:
{
TidyBuffer text = {0};
tidyBufInit(&text);
if (tidyNodeGetText(tdoc, child, &text))
{
printf("TidyNode_Text: %s\n", text.bp);
tidyBufFree(&text);
}
}
break;
case TidyNode_CDATA:
{
TidyBuffer text = {0};
tidyBufInit(&text);
if (tidyNodeGetText(tdoc, child, &text))
{
printf("TidyNode_CDATA: %s\n", text.bp);
tidyBufFree(&text);
}
}
break;
case TidyNode_Start:
{
processNode( tdoc, child);
}
break;
default:
break;
}
}
}
I call the function as follows:
processNode(tdoc, tidyGetRoot(tdoc));
The output of the function is:
TidyNode_Text: New Document
TidyNode_Text: document.write("<STYLE>");
TidyNode_Text: Hello!
Note that angle brackets converted to sgml entities. It's not clear why, seems
a bug to me.
Even if there is some reason behind this, we would like to get an undistorted
original text without escaping. Is there any option to do that?
----------------------------------------------------------------------
Comment By: John Snelson (jpcs)
Date: 2008-01-22 20:58
Message:
Logged In: YES
user_id=1041934
Originator: NO
I have submitted a patch that addresses this bug here:
http://sourceforge.net/tracker/index.php?func=detail&aid=1877642&group_id=27659&atid=390965
My chosen method to fix this has been to introduce a new method,
tidyNodeGetValue(), that does not serialize the node.
----------------------------------------------------------------------
Comment By: BjÃrn HÃhrmann (hoehrmann)
Date: 2007-01-16 23:02
Message:
Logged In: YES
user_id=188003
Originator: NO
tidyNodeGetText is supposed to partially serialize a node, it is not meant
to get an element's text content, I think it would be incorrect for it to
not escape special characters in text. We lack a function to get the
content of text nodes, see the Jan 2003 thread "How to access lexbuf?" on
tidy-develop. I think the addition of such a function would address the
requestor's problem.
If tidyNodeGetText is to improved indepentently of the addition of such a
function, it should continue to escape normal text nodes, whether CDATA
element content like for <script> and <style> is escaped should depend on
whether XML/XHTML output is requested, and the content of comments and PIs
should probably never be escaped as if it was text.
----------------------------------------------------------------------
Comment By: Arnaud Desitter (arnaud02)
Date: 2007-01-16 22:35
Message:
Logged In: YES
user_id=566665
Originator: NO
Bjoern,
Could you comment on this patch ?
Thanks,
----------------------------------------------------------------------
Comment By: ayermakov (ayermakov)
Date: 2007-01-16 13:58
Message:
Logged In: YES
user_id=1688233
Originator: YES
Well, seems the bug entry 1166491 describes the same issue. However I do
have a slightly different opinion how it should be resolved. I believe that
tidyNodeGetText should output text of any node 'as-is', without any
processing (escaping). It's true not only for script and style type of
node, but also for a regular html text.
Or at least it should be under control of some option.
File Added: 1636028.diff
----------------------------------------------------------------------
Comment By: Arnaud Desitter (arnaud02)
Date: 2007-01-16 08:44
Message:
Logged In: YES
user_id=566665
Originator: NO
See
http://tidy.sf.net/issue/1166491 which contains a patch that may be
correct.
It would be nice if you could provide a patch with a rationale so this
issue could be nailed down.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1636028&group_id=27659
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Tidy-tracker mailing list
Tidy-tracker@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/tidy-tracker
Thread at a glance:
Previous Message by Date:
click to view message preview
[ tidy-Patches-1877642 ] Patch to add tidyNodeGetValue() to the TidyLib API
Patches item #1877642, was opened at 2008-01-22 20:55
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390965&aid=1877642&group_id=27659
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Snelson (jpcs)
Assigned to: Nobody/Anonymous (nobody)
Summary: Patch to add tidyNodeGetValue() to the TidyLib API
Initial Comment:
This patch introduces a new function to the TidyLib API named
tidyNodeGetValue(), as discussed here:
http://lists.w3.org/Archives/Public/html-tidy/2008JanMar/0011.html
The function fills a TidyBuffer with the UTF-8 value of any node except an
element. This is very useful for manipulating the in-memory TidyDoc without
serializing it. This problem is also discussed in this bug:
http://sourceforge.net/tracker/index.php?func=detail&aid=1636028&group_id=27659&atid=390963
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390965&aid=1877642&group_id=27659
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
Next Message by Date:
click to view message preview
[ tidy-Patches-1166491 ] Bugfix for printing script/style
Patches item #1166491, was opened at 2005-03-19 13:18
Message generated for change (Comment added) made by arnaud02
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390965&aid=1166491&group_id=27659
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
>Status: Pending
>Resolution: Fixed
Priority: 5
Private: No
Submitted By: Detlev Vendt (detlevv)
Assigned to: Nobody/Anonymous (nobody)
Summary: Bugfix for printing script/style
Initial Comment:
The patch is based on latest source (as of March 17th,
2005) and solves two problems:
- behaviour regarding <table>/<form> re-arrangement, in
certain situations Tidy tends to move tags in errneous
way.
- behaviour regarding the printing the contens of
<style>/<script> tags, the content got mangled.
I carry this with me since 2003. The changes are
working well with a lot of wild-life page.
detlevv
----------------------------------------------------------------------
>Comment By: Arnaud Desitter (arnaud02)
Date: 2008-01-27 10:36
Message:
Logged In: YES
user_id=566665
Originator: NO
See http://tidy.sf.net/issue/1877642
----------------------------------------------------------------------
Comment By: Detlev Vendt (detlevv)
Date: 2005-03-21 19:31
Message:
Logged In: YES
user_id=725273
I know this validation problem, but this isn't the case here.
The problem is, as Bjïrn Hïhrmann said, that
tidyNodeGetText() uses the wrong mode - pls. refer to my
patch.
detlevv
----------------------------------------------------------------------
Comment By: BjÃrn HÃhrmann (hoehrmann)
Date: 2005-03-21 19:28
Message:
Logged In: YES
user_id=188003
Note (again) that in HTML
<script><!----></script>
There is no comment at all, see
http://esw.w3.org/topic/ValidationProblems
----------------------------------------------------------------------
Comment By: Detlev Vendt (detlevv)
Date: 2005-03-21 19:24
Message:
Logged In: YES
user_id=725273
errata: there _is_ a parameter 'hide-comments', but within
style/script (as said) nothing should be changed (neither
deleted nor changed).
----------------------------------------------------------------------
Comment By: Detlev Vendt (detlevv)
Date: 2005-03-21 19:21
Message:
Logged In: YES
user_id=725273
We should not do it more complicated than it is... Some facts:
- tidy should never ever change the content of a <script> or
<style> block (internally, this is content handled as CDATA)
- tidy changes every '<' and '>' into '<' and '>' respectively
within a script or style block without my patch.
This has nothing to do with XHTML (example was taken from
a simple HTML page w/o doctype). Also there is no --hide-
comments parameter (see my option settings, all included...).
At least within a script block comments as shown are valid
and commonly used.
detlevv
----------------------------------------------------------------------
Comment By: BjÃrn HÃhrmann (hoehrmann)
Date: 2005-03-21 19:20
Message:
Logged In: YES
user_id=188003
tidyNodeGetText(...) uses mode = NORMAL when calling
PPrintTree which is not the correct mode for various
elements (or nodes in fact), it would need to use mode =
CDATA for CDATA elements like script and style, etc. You
can't reproduce this using the command line tool as it does
not use tidyNodeGetText(...). I am not sure how to fix this
though, maybe we should add a new function that allows
setting the intial mode and make tidyNodeGetText a wrapper
for that function.
----------------------------------------------------------------------
Comment By: Arnaud Desitter (arnaud02)
Date: 2005-03-21 17:54
Message:
Logged In: YES
user_id=566665
Sorry to be dense.
Considering:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title></title>
<style type="text/css"><!-- testing... -->
a { text-decoration:none; } <!-- another comment -->
--></style>
</head>
<body>
</body>
</html>
what is the problem ?
Can somebody explain what is the expected bahaviour and
what is the minimal set of tidy options necessary to
reproduce this problem. Does it have to do with the content
model of script/style in XHTML ? Does it have to do with
"--hide-comments yes". Any clear explanation ?
----------------------------------------------------------------------
Comment By: BjÃrn HÃhrmann (hoehrmann)
Date: 2005-03-21 16:44
Message:
Logged In: YES
user_id=188003
Well, the two code fragments are only ever equivalent if
the input fragment is HTML and the resulting fragment is
XHTML, otherwise the script element has different content.
And Tidy does not implement this distinction, for a XHTML
input document with
<style type="text/css"><!--...--></style>
Tidy would consider the style element to contain the text
"<!--...-->" rather than a comment with the text "...".
This is essentially necessary to support incorrectly coded
XHTML. So if Tidy behaves as described, this is probably a
bug (probably as this depends on the documentation for
tidyNodeGetText(...) which I did not check).
----------------------------------------------------------------------
Comment By: Arnaud Desitter (arnaud02)
Date: 2005-03-21 13:28
Message:
Logged In: YES
user_id=566665
Could you explain why the current behaviour is not correct ?
Any reference to HTML standard ?
----------------------------------------------------------------------
Comment By: Detlev Vendt (detlevv)
Date: 2005-03-20 15:15
Message:
Logged In: YES
user_id=725273
I did not succeed in reproducing the behaviour regarding the
first problem, it seems to me, that meanwhile the problem is
solved by another change.
I've attached a changed patch, containing the solution for the
mangled comments within style/script only.
detlevv
----------------------------------------------------------------------
Comment By: Detlev Vendt (detlevv)
Date: 2005-03-20 11:02
Message:
Logged In: YES
user_id=725273
Here's the sample code for the second case, comments are
converted to < / > within style-tag using
tidyNodeGetText:
Origin:
<style type="text/css"><!-- testing... -->
a { text-decoration:none; } <!-- another comment -->
--></style>
Output of tidyNodeGetText():
<style type="text/css"><!-- testing... -->
a { text-decoration:none; } <!-- another comment -->
--></style>
Options set:
tidyOptSetInt (tdoc, TidyIndentSpaces, 0);
tidyOptSetInt (tdoc, TidyWrapLen, 9999);
tidyOptSetBool (tdoc, TidyHideComments, yes);
tidyOptSetBool (tdoc, TidyForceOutput, yes);
tidyOptSetBool (tdoc, TidyQuoteAmpersand, no);
tidyOptSetBool (tdoc, TidyMark, no);
tidyOptSetBool (tdoc, TidyNumEntities, no);
----------------------------------------------------------------------
Comment By: BjÃrn HÃhrmann (hoehrmann)
Date: 2005-03-19 18:02
Message:
Logged In: YES
user_id=188003
Could you also attach test cases that demonstrate the
undesired behavior and how current tidy and your patch
would handle these cases?
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390965&aid=1166491&group_id=27659
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/_______________________________________________
Tidy-tracker mailing list
Tidy-tracker@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/tidy-tracker
Previous Message by Thread:
click to view message preview
[ tidy-Patches-1877642 ] Patch to add tidyNodeGetValue() to the TidyLib API
Patches item #1877642, was opened at 2008-01-22 20:55
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390965&aid=1877642&group_id=27659
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Snelson (jpcs)
Assigned to: Nobody/Anonymous (nobody)
Summary: Patch to add tidyNodeGetValue() to the TidyLib API
Initial Comment:
This patch introduces a new function to the TidyLib API named
tidyNodeGetValue(), as discussed here:
http://lists.w3.org/Archives/Public/html-tidy/2008JanMar/0011.html
The function fills a TidyBuffer with the UTF-8 value of any node except an
element. This is very useful for manipulating the in-memory TidyDoc without
serializing it. This problem is also discussed in this bug:
http://sourceforge.net/tracker/index.php?func=detail&aid=1636028&group_id=27659&atid=390963
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390965&aid=1877642&group_id=27659
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
Next Message by Thread:
click to view message preview
[ tidy-Bugs-1636028 ] Tidy API function tidyNodeGetText escapes output
Bugs item #1636028, was opened at 2007-01-15 16:27
Message generated for change (Comment added) made by arnaud02
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1636028&group_id=27659
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: TidyLib APIs
Group: Current - all platforms
>Status: Pending
>Resolution: Fixed
Priority: 6
Private: No
Submitted By: ayermakov (ayermakov)
Assigned to: Nobody/Anonymous (nobody)
Summary: Tidy API function tidyNodeGetText escapes output
Initial Comment:
We use tidy mainly through its API, we walk parse tree and extract information
about each node in the tree, and transform it in our own tree structure. The
issue happens when we use tidyNodeGetText() on the TidyNode_Text,
TidyNode_Comment and TidyNode_CDATA type of node.
Consider the input:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<TITLE> New Document </TITLE>
<SCRIPT type="text/javascript">document.write("<STYLE>");</SCRIPT>
</HEAD>
<BODY>
Hello!
</BODY>
</HTML>
I'm walking the tidy parse tree and print out a text of each node (for
simplicity reasons). Here is a small function to walk the tree:
static void processNode(TidyDoc tdoc, TidyNode tnod)
{
TidyNode child;
for ( child = tidyGetChild(tnod); child; child = tidyGetNext(child) )
{
ctmbstr name = tidyNodeGetName( child );
TidyNodeType type = tidyNodeGetType(child);
switch ( type )
{
case TidyNode_Comment:
{
TidyBuffer text = {0};
tidyBufInit(&text);
if (tidyNodeGetText(tdoc, child, &text))
{
printf("TidyNode_Comment: %s\n",
text.bp);
tidyBufFree(&text);
}
}
break;
case TidyNode_Text:
{
TidyBuffer text = {0};
tidyBufInit(&text);
if (tidyNodeGetText(tdoc, child, &text))
{
printf("TidyNode_Text: %s\n", text.bp);
tidyBufFree(&text);
}
}
break;
case TidyNode_CDATA:
{
TidyBuffer text = {0};
tidyBufInit(&text);
if (tidyNodeGetText(tdoc, child, &text))
{
printf("TidyNode_CDATA: %s\n", text.bp);
tidyBufFree(&text);
}
}
break;
case TidyNode_Start:
{
processNode( tdoc, child);
}
break;
default:
break;
}
}
}
I call the function as follows:
processNode(tdoc, tidyGetRoot(tdoc));
The output of the function is:
TidyNode_Text: New Document
TidyNode_Text: document.write("<STYLE>");
TidyNode_Text: Hello!
Note that angle brackets converted to sgml entities. It's not clear why, seems
a bug to me.
Even if there is some reason behind this, we would like to get an undistorted
original text without escaping. Is there any option to do that?
----------------------------------------------------------------------
>Comment By: Arnaud Desitter (arnaud02)
Date: 2008-01-27 10:36
Message:
Logged In: YES
user_id=566665
Originator: NO
See http://tidy.sf.net/issue/1877642.
----------------------------------------------------------------------
Comment By: John Snelson (jpcs)
Date: 2008-01-22 20:58
Message:
Logged In: YES
user_id=1041934
Originator: NO
I have submitted a patch that addresses this bug here:
http://sourceforge.net/tracker/index.php?func=detail&aid=1877642&group_id=27659&atid=390965
My chosen method to fix this has been to introduce a new method,
tidyNodeGetValue(), that does not serialize the node.
----------------------------------------------------------------------
Comment By: BjÃrn HÃhrmann (hoehrmann)
Date: 2007-01-16 23:02
Message:
Logged In: YES
user_id=188003
Originator: NO
tidyNodeGetText is supposed to partially serialize a node, it is not meant
to get an element's text content, I think it would be incorrect for it to
not escape special characters in text. We lack a function to get the
content of text nodes, see the Jan 2003 thread "How to access lexbuf?" on
tidy-develop. I think the addition of such a function would address the
requestor's problem.
If tidyNodeGetText is to improved indepentently of the addition of such a
function, it should continue to escape normal text nodes, whether CDATA
element content like for <script> and <style> is escaped should depend on
whether XML/XHTML output is requested, and the content of comments and PIs
should probably never be escaped as if it was text.
----------------------------------------------------------------------
Comment By: Arnaud Desitter (arnaud02)
Date: 2007-01-16 22:35
Message:
Logged In: YES
user_id=566665
Originator: NO
Bjoern,
Could you comment on this patch ?
Thanks,
----------------------------------------------------------------------
Comment By: ayermakov (ayermakov)
Date: 2007-01-16 13:58
Message:
Logged In: YES
user_id=1688233
Originator: YES
Well, seems the bug entry 1166491 describes the same issue. However I do
have a slightly different opinion how it should be resolved. I believe that
tidyNodeGetText should output text of any node 'as-is', without any
processing (escaping). It's true not only for script and style type of
node, but also for a regular html text.
Or at least it should be under control of some option.
File Added: 1636028.diff
----------------------------------------------------------------------
Comment By: Arnaud Desitter (arnaud02)
Date: 2007-01-16 08:44
Message:
Logged In: YES
user_id=566665
Originator: NO
See http://tidy.sf.net/issue/1166491 which contains a patch that may be
correct.
It would be nice if you could provide a patch with a rationale so this
issue could be nailed down.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=390963&aid=1636028&group_id=27659
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Tidy-tracker mailing list
Tidy-tracker@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/tidy-tracker