|
Re: Autogenerating Topic Maps: msg#00142text.xml.xtm.general
> Premnath Raghavendran wrote: > > How do I autogenerate Topic Maps? My data source is a > collection of word documents on various subjects. That depends critically on the information in your corpus you want to use when autogenerating the topic map. Let's assume you want one that just holds word counts (a trivial case). Conceptually, you might write this code: 1) create an empty Topic Map, TM 2) for each document found in the corpus: A) make a new corresponding "document" topic, D, in TM B) add to D some selected, document-embedded metadata C) for each word found within the given document: i) look in TM for a "word" topic, W, named by the word a) if found, increment its contained word count b) if not, build a new W topic with count=1 ii) in the TM, associate D with W When this stops, TM will hold a topic for each word and document. Thanks to step {C.ii}, its associations will also say what words each document holds. > If I want to develop a logic to automatically generate topic > maps for these various documents, how do I do? Your phrasing suggests goals more complex than word counts, but I can't tell what. Please write back and get more specific. If your answer can be made to fit into a variation of the pseudo code, you'll have the germ of a spec. These links may help to clarify the possible options: [1] http://www.ontopia.net/topicmaps/explorations.html [2] http://www.lexikos.com/nlptools.jsp > An article at Ontopia suggests to first create RDF out > of them & then move ahead to Topic Maps. The first article on [1] under "autogeneration" indeed does suggest using RDF when the data source is structured. If your corpus is full of data tables, that might apply. But I would expect strings and their locations within the corpus to be more useful. Steve Peppers' article (just below that one) cites embedded metadata as a source for step 2B. If you can get at it, I'd add the semi-structured markup in Word documents (exposed if you save one to HTML, WordPerfect, etc.) It may let you find and add (e.g.) "headings", to locate words more precisely within your TM. Steve also mentions unstructured text as a source. If your pseudo code spec replaces "word" with "phrase", "name", or even "root word", it will cross a line into NLP, a realm loaded with complex issues. Here, the R&D costs get larger, and accuracy may vary widely with the details of your specs, code, and corpus. But there is no free lunch, and for a large corpus, NLP approachs often make economic sense. Operator guidance boosts accuracy, so if you can accept the "assisted generation" of TMs instead of their "autogeneration", use it. [2] shows the kinds of modules needed (in the green and blue areas). What they jointly build are symbolic models of what each given "phrase" refers to - its subject. Such models could be given in RDF, or some other formal language specific to the NLP processor. Regardless of those details, such phrase-referrent models are basically just what you need to find or build a topic that will represent the subject inside your new TM. That's why NLP is such a useful tool here! Typically, to use NLP, you must separately model beforehand (e.g., using [2]'s yellow modules) all the *types* of subjects whose references you will seek in your corpus. The more effort you put into this task, the less "autogenerated" your TMs will seem. Your specs should thus address how much prep work you will accept. > Please help me out from implementation point of view. > How exactly do I go about it? More specs are needed to decide on implementation. Sorry, but the range of options here is broad, as you can see. Cheers, Dan Corwin |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | TMCL Requirements Draft and mail list migration: 00142, Mary Nishikawa |
|---|---|
| Next by Date: | Re: Agents and Topic Maps: 00142, Johannes Koppenwallner |
| Previous by Thread: | Autogenerating Topic Mapsi: 00142, Premnath Raghavendran |
| Next by Thread: | Fw: [seweb-list] CFP: Visualisation of the Semantic Web, IV03-VSW: 00142, Martin Bryan |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |