On 8/23/05, Marvin Humphrey <marvin-Z34TZEgQMOFloyL29VTzIw@xxxxxxxxxxxxxxxx>
wrote:
> StopFilter doesn't work with TokenBatch because it grabs Tokens one
> at a time (via a method call to next()), then grabs the text from the
> Token (another method call, this time to text()), then checks to see
> if the text is present in the stoplist hash.
It is entirely possible I am missing a trick here, but looking daft
ain't stopped me before...
* Add next_batch to CharTokenizer.pm as a private method (_next_batch)
* _next_batch does as you suggest, but stores it internally in a cache
rather than return it
* ChatTokenizer->next() looks in the cache first, if it's empty it
calls _next_batch
If the cache size is N, then every Nth call may be a little slower,
but you gain the speedup of processing more than one token at a time
in CharTokenizer.
Of the several optimizations mentioned on the "upside", I think this
could be done to keep benefits 2-5, but you don't get (1) : fewer
method calls.
|