|
Segmentation fault using libtidy (symptoms, diagnosis, and bush medicine cu: msg#00018web.html-tidy.devel
Hi, I've been successfully using libtidy while developing a Ruby application for some time, using the Tidy Ruby gem (gem == packaged library to all intents and purposes). I develop on MacOSX 10.4 on a 12" PowerBook. When it came time to deploy on Linux (first on Debian3.1 later on CentOS) we ran into problems with a segmentation fault in our application. After some time I narrowed it down to our aggregator. We use tidy in the aggregator to clean up the HTML we are getting via RSS to make it amenable to analysis. After weeding out a number of other contenders I finally tracked the problem down to our Tidy related code: def self.sanitize( html ) # convert to nice XHTML out = Tidy.open(:show_warnings=>false,:show_errors=>0,:output_xhtml=>true,:show_body_only=>true,:alt_text=>"Content Image",:enclose_block_text=>true, :output_encoding=>"utf8") do |tidy| tidy.clean( html ) end # remove bad boy tags out = out.gsub( /<script.+?\/>/nm, "" ) out = out.gsub( /<script.*?>.*<\/script>/nm, "" ) out = out.gsub( /<\/?font.*?\/?>/nm, "" ) tag_links_with_class( out, 'content_link' ) end In testing it appeared that the segmentation fault occurred within the call to tidy.clean(). When I extracted this as a testcase (outside our rails application) however it ran without causing the SIGSEGV. Yet our use of tidy was isolated in this one spot and I confirmed that the problem occurred on the first call to Tidy. I couldn't see how Rails could possibly be affecting this issue. A little spade-work with GDB turned up something very interesting: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread -1209771232 (LWP 26885)] 0x4e5aa833 in GetToken () from /usr/lib/libMagick.so.6 libMagick it turns out is part of imageMagick, what the hell are we doing in there? It certainly explains the SIGSEGV since libMagick will probably do nasty things to the tidy buffer. It occurred to me that our Rails application uses RMagick which is the Ruby interface to imageMagick and that this loads libMagick.so.6. Somehow loading this was affecting Tidy. Since I had the CVS version of libtidy (to ensure that it had the problem too) I modified the Makefile to turn on debugging and get a proper backtrace: #0 0x4e5aa833 in GetToken () from /usr/lib/libMagick.so.6 #1 0xb729dca1 in ParseHTML (doc=0x8855710, html=0x8536b08, mode=0) at parser.c:3635 #2 0xb729ea87 in ParseDocument (doc=0x8855710) at parser.c:4079 #3 0xb72bc868 in tidyDocParseStream (doc=0x8855710, in=0x84d9e10) at tidylib.c:1171 #4 0xb72bc084 in tidyDocParseString (doc=0x8855710, content=0x84d9c08 "<blockquote>\n <p>\"Benjamin Franklin was shown the new American constitution, and he said, 'I don't like it, but I will vote for it because we need something right now. But this constitution in time"...) at tidylib.c:881 #5 0xb72bbd66 in tidyParseString (tdoc=0x8855710, content=0x84d9c08 "<blockquote>\n <p>\"Benjamin Franklin was shown the new American constitution, and he said, 'I don't like it, but I will vote for it because we need something right now. But this constitution in time"...) at tidylib.c:803 #6 0xb72e36c9 in rb_dlsym_guardcall (type=73 'I', ret=0xbff701b8, stack=0xbff700a0, func=0xb72bbd0f) at sym.c:399 Everything looks normal up to the call to ParseHTML(). parser.c:3635 looks like: node = GetToken( doc, ignoreWhitespace ); GetToken(), the same function as in libMagick. I noted that lexer.c has a GetToken() function which is almost certainly the function ParseHTML() is expecting to call. When I modified my previously working test case to load RMagick it began to fail. So the problem definitely seems to be that loading the libMagick.so library overwrites GetToken() in the symbol space and, in the process, breaks the libtidy library. It's been a long time since I did any C coding and I'm not really familiar with the Unix library loading rules, especially as they pertain to dynamically loading shared libraries so I'm not sure if there is a rational explanation for this behaviour. My current work around is to rename GetToken() to GetTidyToken() in lexer.h, lexer.c, and parser.c and compile my own version of the library. In testing so far I've not experienced any further problems. However I am concerned that the problem could easily be lurking in other function names. I hope this information might be useful in improving libtidy. Please let me know if there is anything else I can add. Regards, Matt -- Matt Mower :: http://matt.blogs.it/ Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | tidy bug: 00018, Sérgio Monteiro Basto |
|---|---|
| Next by Date: | Re: Segmentation fault using libtidy (symptoms, diagnosis, and bush medicine cure): 00018, Nuno Lopes |
| Previous by Thread: | tidy bugi: 00018, Sérgio Monteiro Basto |
| Next by Thread: | Re: Segmentation fault using libtidy (symptoms, diagnosis, and bush medicine cure): 00018, Nuno Lopes |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | Mail Home | sitemap | FAQ | advertise |