logo       
Google Custom Search
    AddThis Social Bookmark Button
-->

Re: Processing tokens in batches - downside: msg#00020

Subject: Re: Processing tokens in batches - downside
On 8/23/05, Marvin Humphrey <marvin-Z34TZEgQMOFloyL29VTzIw@xxxxxxxxxxxxxxxx> 
wrote:
> StopFilter doesn't work with TokenBatch because it grabs Tokens one
> at a time (via a method call to next()), then grabs the text from the
> Token (another method call, this time to text()), then checks to see
> if the text is present in the stoplist hash.

It is entirely possible I am missing a trick here, but looking daft
ain't stopped me before...

* Add next_batch to CharTokenizer.pm as a private method (_next_batch)
* _next_batch does as you suggest, but stores it internally in a cache
rather than return it
* ChatTokenizer->next() looks in the cache first, if it's empty it
calls _next_batch

If the cache size is N, then every Nth call may be a little slower,
but you gain the speedup of processing more than one token at a time
in CharTokenizer.

Of the several optimizations mentioned on the "upside", I think this
could be done to keep benefits 2-5, but you don't get (1) : fewer
method calls.


<Prev in Thread] Current Thread [Next in Thread>