osdir.com
mailing list archive F.A.Q. -since 2001!



Subject: Segmentation fault using libtidy (symptoms,
diagnosis, and bush medicine cure) - msg#00018

List: web.html-tidy.devel

Mail Archive Navigation:
by Date: Prev Next Date Index by Thread: Prev Next Thread Index

Hi,

I've been successfully using libtidy while developing a Ruby
application for some time, using the Tidy Ruby gem (gem == packaged
library to all intents and purposes). I develop on MacOSX 10.4 on a
12" PowerBook.

When it came time to deploy on Linux (first on Debian3.1 later on
CentOS) we ran into problems with a segmentation fault in our
application. After some time I narrowed it down to our aggregator. We
use tidy in the aggregator to clean up the HTML we are getting via RSS
to make it amenable to analysis.

After weeding out a number of other contenders I finally tracked the
problem down to our Tidy related code:

def self.sanitize( html )
# convert to nice XHTML
out =
Tidy.open(:show_warnings=>false,:show_errors=>0,:output_xhtml=>true,:show_body_only=>true,:alt_text=>"Content
Image",:enclose_block_text=>true, :output_encoding=>"utf8") do |tidy|
tidy.clean( html )
end

# remove bad boy tags
out = out.gsub( /<script.+?\/>/nm, "" )
out = out.gsub( /<script.*?>.*<\/script>/nm, "" )
out = out.gsub( /<\/?font.*?\/?>/nm, "" )
tag_links_with_class( out, 'content_link' )
end

In testing it appeared that the segmentation fault occurred within the
call to tidy.clean(). When I extracted this as a testcase (outside
our rails application) however it ran without causing the SIGSEGV.

Yet our use of tidy was isolated in this one spot and I confirmed that
the problem occurred on the first call to Tidy. I couldn't see how
Rails could possibly be affecting this issue.

A little spade-work with GDB turned up something very interesting:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1209771232 (LWP 26885)]
0x4e5aa833 in GetToken () from /usr/lib/libMagick.so.6

libMagick it turns out is part of imageMagick, what the hell are we
doing in there? It certainly explains the SIGSEGV since libMagick will
probably do nasty things to the tidy buffer.

It occurred to me that our Rails application uses RMagick which is the
Ruby interface to imageMagick and that this loads libMagick.so.6.
Somehow loading this was affecting Tidy.

Since I had the CVS version of libtidy (to ensure that it had the
problem too) I modified the Makefile to turn on debugging and get a
proper backtrace:

#0 0x4e5aa833 in GetToken () from /usr/lib/libMagick.so.6
#1 0xb729dca1 in ParseHTML (doc=0x8855710, html=0x8536b08, mode=0) at
parser.c:3635
#2 0xb729ea87 in ParseDocument (doc=0x8855710) at parser.c:4079
#3 0xb72bc868 in tidyDocParseStream (doc=0x8855710, in=0x84d9e10) at
tidylib.c:1171
#4 0xb72bc084 in tidyDocParseString (doc=0x8855710,
content=0x84d9c08 "<blockquote>\n <p>\"Benjamin Franklin was
shown the new American constitution, and he said, 'I don't like it,
but I will vote for it because we need something right now. But this
constitution in time"...)
at tidylib.c:881
#5 0xb72bbd66 in tidyParseString (tdoc=0x8855710,
content=0x84d9c08 "<blockquote>\n <p>\"Benjamin Franklin was
shown the new American constitution, and he said, 'I don't like it,
but I will vote for it because we need something right now. But this
constitution in time"...)
at tidylib.c:803
#6 0xb72e36c9 in rb_dlsym_guardcall (type=73 'I', ret=0xbff701b8,
stack=0xbff700a0, func=0xb72bbd0f) at sym.c:399

Everything looks normal up to the call to ParseHTML(). parser.c:3635 looks like:

node = GetToken( doc, ignoreWhitespace );

GetToken(), the same function as in libMagick. I noted that lexer.c
has a GetToken() function which is almost certainly the function
ParseHTML() is expecting to call.

When I modified my previously working test case to load RMagick it
began to fail. So the problem definitely seems to be that loading the
libMagick.so library overwrites GetToken() in the symbol space and, in
the process, breaks the libtidy library.

It's been a long time since I did any C coding and I'm not really
familiar with the Unix library loading rules, especially as they
pertain to dynamically loading shared libraries so I'm not sure if
there is a rational explanation for this behaviour.

My current work around is to rename GetToken() to GetTidyToken() in
lexer.h, lexer.c, and parser.c and compile my own version of the
library. In testing so far I've not experienced any further problems.
However I am concerned that the problem could easily be lurking in
other function names.

I hope this information might be useful in improving libtidy. Please
let me know if there is anything else I can add.

Regards,

Matt

--
Matt Mower :: http://matt.blogs.it/

Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642


Thread at a glance:

Previous Message by Date:

tidy bug

Hi, with latest tidy source, obtained using CVS. I found this BUG ! tidy -config cfg_default.txt entitys.xml is very wrong tidy -config cfg_default.txt entitys_workarround.xml is correct ! diff entitys.xml entitys_workarround.xml 5c5 < <pattern><re><![CDATA[.pt/200[6789]]]> </re></pattern> --- > <pattern><re><![CDATA[.pt/200[6789]/]]> </re></pattern> Regards, -- SÃrgio M. B. cfg_default.txt Description: Text document entitys.xml Description: application/xml entitys_workarround.xml Description: application/xml _______________________________________________ Tidy-develop mailing list Tidy-develop@xxxxxxxxxxxxxxxxxxxxx https://lists.sourceforge.net/lists/listinfo/tidy-develop

Next Message by Date:

Re: Segmentation fault using libtidy (symptoms, diagnosis, and bush medicine cure)

> It's been a long time since I did any C coding and I'm not really > familiar with the Unix library loading rules, especially as they > pertain to dynamically loading shared libraries so I'm not sure if > there is a rational explanation for this behaviour. Its normal, as the symbols are duplicated. Anyway, Ruby could use some of dlopen() flags (supported by newer glibcs) to try to avoid such problems. PHP uses those glibc tricks to try to avoid these problems (and there are many..), although miracles aren't possible, of course :) I would say that using RTLD_DEEPBIND in Ruby's dlopen mechanism would fix your problem. Anyway, thats why usually the C libraries prefix all their functions with the library name.. > I hope this information might be useful in improving libtidy. Please > let me know if there is anything else I can add. The problem is: how to "fix" the problem without breaking compatibility with older tidy lib versions? Nuno Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

Previous Message by Thread:

tidy bug

Hi, with latest tidy source, obtained using CVS. I found this BUG ! tidy -config cfg_default.txt entitys.xml is very wrong tidy -config cfg_default.txt entitys_workarround.xml is correct ! diff entitys.xml entitys_workarround.xml 5c5 < <pattern><re><![CDATA[.pt/200[6789]]]> </re></pattern> --- > <pattern><re><![CDATA[.pt/200[6789]/]]> </re></pattern> Regards, -- SÃrgio M. B. cfg_default.txt Description: Text document entitys.xml Description: application/xml entitys_workarround.xml Description: application/xml _______________________________________________ Tidy-develop mailing list Tidy-develop@xxxxxxxxxxxxxxxxxxxxx https://lists.sourceforge.net/lists/listinfo/tidy-develop

Next Message by Thread:

Re: Segmentation fault using libtidy (symptoms, diagnosis, and bush medicine cure)

> It's been a long time since I did any C coding and I'm not really > familiar with the Unix library loading rules, especially as they > pertain to dynamically loading shared libraries so I'm not sure if > there is a rational explanation for this behaviour. Its normal, as the symbols are duplicated. Anyway, Ruby could use some of dlopen() flags (supported by newer glibcs) to try to avoid such problems. PHP uses those glibc tricks to try to avoid these problems (and there are many..), although miracles aren't possible, of course :) I would say that using RTLD_DEEPBIND in Ruby's dlopen mechanism would fix your problem. Anyway, thats why usually the C libraries prefix all their functions with the library name.. > I hope this information might be useful in improving libtidy. Please > let me know if there is anything else I can add. The problem is: how to "fix" the problem without breaking compatibility with older tidy lib versions? Nuno Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
blog comments powered by Disqus

Home | News | Sitemap | FAQ | advertise | OSDir is an Inevitable website. GBiz is too!