logo       

Re: question: msg#00271

nutch-user.lucene.apache.org

Subject: Re: question

i believe it can.
check your configuration files, nutch-site.xml and nutch-default.xml.

you will find something like

<property>
<name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(text|html|swf|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems
with the
underlying commons-httpclient library.
</description>
</property>

add to the parsers "msword".
change
parse-(text|html|swf|pdf)|
to
parse-(text|html|swf|pdf|msword)

there is a plugin in plugins folder,
which is parsing ms word documents.
parse-msword

i have not tried it so far.

Jair Piedrahita Vargas schrieb:
> Can Nutch search inside the content of an msword file? I've tried, but it
> says "parser not found for contentType=application/msword"
> What can I do to correct this Error?
>
> Thanks
>
> JAIR PIEDRAHITA VARGAS
> Gerencia de Investigación y Nuevas Tecnologías
> Teléfono: 4040000 Ext 41632
> Av. los Industriales Cra 48 # 26-85 piso 6B
> BANCOLOMBIA S.A
>
>
> ________________________________
> El contenido de este mensaje puede ser información privilegiada y
> confidencial. Si usted no es el destinatario real del mismo, por favor
> informe de ello a quien lo envía y destrúyalo en forma inmediata. Está
> prohibida su retención, grabación, utilización o divulgación con cualquier
> propósito. Este mensaje ha sido verificado con software antivirus; en
> consecuencia, el remitente de éste no se hace responsable por la presencia en
> él o en sus anexos de algún virus que pueda generar daños en los equipos o
> programas del destinatario.
> ******************************************************************************************************
> This communication (including all attachments) may contain information that
> is private, confidential and privileged. If you have received this
> communication in error; please notify the sender immediately, delete this
> communication from all data storage devices and destroy all hard copies. Any
> use, dissemination, distribution, copying or disclosure of this message and
> any attachments, in whole or in part, by anyone other than the intended
> recipient(s) is strictly prohibited. This message has been checked with an
> antivirus software; accordingly, the sender is not liable for the presence of
> any virus in attachments that causes or may cause damage to the recipient's
> equipment or software.
>
>

<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | Mail Home | sitemap | FAQ | advertise