logo       
Bookmark and Share

Segmentation fault using libtidy (symptoms, diagnosis, and bush medicine cu: msg#00018

web.html-tidy.devel

Subject: Segmentation fault using libtidy (symptoms, diagnosis, and bush medicine cure)

Hi,

I've been successfully using libtidy while developing a Ruby
application for some time, using the Tidy Ruby gem (gem == packaged
library to all intents and purposes). I develop on MacOSX 10.4 on a
12" PowerBook.

When it came time to deploy on Linux (first on Debian3.1 later on
CentOS) we ran into problems with a segmentation fault in our
application. After some time I narrowed it down to our aggregator. We
use tidy in the aggregator to clean up the HTML we are getting via RSS
to make it amenable to analysis.

After weeding out a number of other contenders I finally tracked the
problem down to our Tidy related code:

def self.sanitize( html )
# convert to nice XHTML
out =
Tidy.open(:show_warnings=>false,:show_errors=>0,:output_xhtml=>true,:show_body_only=>true,:alt_text=>"Content
Image",:enclose_block_text=>true, :output_encoding=>"utf8") do |tidy|
tidy.clean( html )
end

# remove bad boy tags
out = out.gsub( /<script.+?\/>/nm, "" )
out = out.gsub( /<script.*?>.*<\/script>/nm, "" )
out = out.gsub( /<\/?font.*?\/?>/nm, "" )
tag_links_with_class( out, 'content_link' )
end

In testing it appeared that the segmentation fault occurred within the
call to tidy.clean(). When I extracted this as a testcase (outside
our rails application) however it ran without causing the SIGSEGV.

Yet our use of tidy was isolated in this one spot and I confirmed that
the problem occurred on the first call to Tidy. I couldn't see how
Rails could possibly be affecting this issue.

A little spade-work with GDB turned up something very interesting:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1209771232 (LWP 26885)]
0x4e5aa833 in GetToken () from /usr/lib/libMagick.so.6

libMagick it turns out is part of imageMagick, what the hell are we
doing in there? It certainly explains the SIGSEGV since libMagick will
probably do nasty things to the tidy buffer.

It occurred to me that our Rails application uses RMagick which is the
Ruby interface to imageMagick and that this loads libMagick.so.6.
Somehow loading this was affecting Tidy.

Since I had the CVS version of libtidy (to ensure that it had the
problem too) I modified the Makefile to turn on debugging and get a
proper backtrace:

#0 0x4e5aa833 in GetToken () from /usr/lib/libMagick.so.6
#1 0xb729dca1 in ParseHTML (doc=0x8855710, html=0x8536b08, mode=0) at
parser.c:3635
#2 0xb729ea87 in ParseDocument (doc=0x8855710) at parser.c:4079
#3 0xb72bc868 in tidyDocParseStream (doc=0x8855710, in=0x84d9e10) at
tidylib.c:1171
#4 0xb72bc084 in tidyDocParseString (doc=0x8855710,
content=0x84d9c08 "<blockquote>\n <p>\"Benjamin Franklin was
shown the new American constitution, and he said, 'I don't like it,
but I will vote for it because we need something right now. But this
constitution in time"...)
at tidylib.c:881
#5 0xb72bbd66 in tidyParseString (tdoc=0x8855710,
content=0x84d9c08 "<blockquote>\n <p>\"Benjamin Franklin was
shown the new American constitution, and he said, 'I don't like it,
but I will vote for it because we need something right now. But this
constitution in time"...)
at tidylib.c:803
#6 0xb72e36c9 in rb_dlsym_guardcall (type=73 'I', ret=0xbff701b8,
stack=0xbff700a0, func=0xb72bbd0f) at sym.c:399

Everything looks normal up to the call to ParseHTML(). parser.c:3635 looks like:

node = GetToken( doc, ignoreWhitespace );

GetToken(), the same function as in libMagick. I noted that lexer.c
has a GetToken() function which is almost certainly the function
ParseHTML() is expecting to call.

When I modified my previously working test case to load RMagick it
began to fail. So the problem definitely seems to be that loading the
libMagick.so library overwrites GetToken() in the symbol space and, in
the process, breaks the libtidy library.

It's been a long time since I did any C coding and I'm not really
familiar with the Unix library loading rules, especially as they
pertain to dynamically loading shared libraries so I'm not sure if
there is a rational explanation for this behaviour.

My current work around is to rename GetToken() to GetTidyToken() in
lexer.h, lexer.c, and parser.c and compile my own version of the
library. In testing so far I've not experienced any further problems.
However I am concerned that the problem could easily be lurking in
other function names.

I hope this information might be useful in improving libtidy. Please
let me know if there is anything else I can add.

Regards,

Matt

--
Matt Mower :: http://matt.blogs.it/

Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | Mail Home | sitemap | FAQ | advertise