|
Re: Tidy on WORD-2k htm files: msg#00254web.html-tidy.user
Sailesh Panchang wrote: Hello List, The problem is that the files generated by M$Word are not really HTML; they are a Micro$oft-proprietary XML, which just happens to be a superset of XHTML. These files look OK in a browser only because most browsers have been specifically designed to ignore unknown elements and attributes, rather than failing when they are encountered. Tidy has a mode specifically designed to clean M$Word XML files. From the command prompt type: tidy --word-2000 yes [input.htm] > [output.html] Tidy is an extraordinarily flexible program, which means that there are a plethora of command line options. You should carefully review the list of options at http://tidy.sourceforge.net/docs/quickref.html before concluding that Tidy will not do what you want. The documentation states that in case Tidy encounters errors, the conversion is unpredictable. So does it mean it is not going to work? There are a number of common HTML coding errors that are simply too ambiguous to be fixed automagically; these errors must be fixed by a human who presumably knows what was intended. When tidy encounters one of these errors it prints an error message identifying the line number where the error occurred, so a human can look at the problem, but normally does not produce _any_ output in these cases. Tidy can be forced to produce output even when it cannot fix the errors by specifying the "--force-output yes" option, but the output will probably not be correct HTML. Is using the WORD 2.0 filter a more reliable option? No. While part of the flaw in Word 2000 output is the non-standard elements and attributes, Word 2000 is known for producing "bloated" XML. This is due at least in part to the fact that M$Word insists on adding font and class specifications to _every_ paragraph in a file, even when they are all identical. Part of the functions provided by the "--word-2000 yes" option is to strip this potentially irrelevant material from the file. The Word 2000 filter leaves this stuff in the file (although much of this badness can be removed by running tidy with the "--drop-font-tags yes" option, after using the Word 2000 filter). Thanks, |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Tidy on WORD-2k htm files: 00254, Sailesh Panchang |
|---|---|
| Next by Date: | Re: LICENSE PLATES - Click here to see thousands of plates for sale.: 00254, Michael Alter |
| Previous by Thread: | Tidy on WORD-2k htm filesi: 00254, Sailesh Panchang |
| Next by Thread: | Banner Advertising Network advertisement Put your website to work: 00254, Surfer's Choice |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |