osdir.com
mailing list archive

Subject: Re: Proposal of some radical changes to API - msg#00053

List: lang.ruby.ferret.general

Date: Prev Next Index Thread: Prev Next Index
On 6/8/06, Marvin Humphrey <marvin-Z34TZEgQMOFloyL29VTzIw@xxxxxxxxxxxxxxxx>
wrote:
>
> On Jun 6, 2006, at 4:33 PM, Neville Burnell wrote:
>
> >>> I asked the question because I honestly wanted to see a concrete
> >>> example of an application that couldn't be handled within the
> >>> constraint of pre- defined fields.
> >
> > My current application involves writing a web application which can
> > seach a ferret index built from a SQL database.
> >
> > The idea is that the customer supplies SQLs for say customers,
> > suppliers, sales and puchases etc. The app then retrieves the rows
> > from
> > the datasource and indexes using Ferret. The app provides both a html
> > website as an interface to the index, and also an XML api which can be
> > used by non browser clients.
> >
> > The field set is quite different for each SQL [and is essentially
> > out of
> > our control].
>
> So at what point does your app learn the structure of the SQL table?
> Would it work if you were to start each session by telling the index
> writer about the fields that were coming?
>
> def connect(field_names)
> field_names.each do |field_name|
> index.spec_field(field_name) # use default properties
> end
> end
>
> def add_to_index(submission)
> index.add_hash_as_doc(submission)
> end
>
> I can imagine a scenario where that's not possible, and the fields
> may change up on each insert. In that case, under the interface I
> envision, you'd have to do something like...
>
> def add_to_index(submission)
> submission.each do |field_name, value|
> index.spec_field(field_name) # use default properties
> end
> index.add_hash_as_doc(submission)
> end
>
> FWIW, this stuff is happening anyway, behind the scenes.
> Essentially, every time you add a field to an index, Ferret asks,
> "Say, is this field indexed? And how about TermVectors, you want
> those?" The 10_000th time you add the field, Ferret asks, "This
> field wasn't indexed before -- have you changed your mind? OK, I'll
> check back again later."... 1_000_000th doc: "You sure? How about I
> make it indexed? Awwwww, c'mon... Hey, could you use some TermVectors?"
>
> When it makes sense, of course you want to simplify the interface and
> hide the complexity inside the library. However, given that it's not
> possible to make coherent updates to existing data within a Lucene-
> esque file format, my argument is that field definitions should never
> change. So the repeated calls to spec_field above would be
> completely redundant -- you'd get an error if you ever tried to
> change the field def.
>
> Your app would be a little less elegant, it's true (performance
> impact would be somewhere between insignificant and tiny unless you
> had a zillion very short fields). However, I think the use case
> where the fields are not known in advance is the exception rather
> than the rule.
>
> It would also be possible to use Dave's polymorphic hash-as-doc
> technique, where if the hash value is a Field object, you spec out
> the field definition using that Field object's properties -- you
> would just use full-on Field objects for each field. My argument
> would be, again, that the field definitions should not change. If
> you don't agree with that and the definition has to be modifiable
> (within the current constraints), then that single-method technique
> is probably better. However, if the definition is not modifiable,
> then I'd argue it's cleaner to separate the two functions.

I completely agree with you that field definitions should not change
once they are set. However, I don't think having the library add
missing fields with a default set of values (which would be set when
you create the index) adds too much complexity. You simply need to
check whether the field already exists. You already have to look up
the field number anyway. So, to add dynamic fields, simply check to
make sure a valid field number was found and add the field if it
wasn't.

Of course this is just as easy to implement in the binding code so I
don't mind whether it gets into Lucy core or not. As long as you can
add new fields to an index after documents have been added, I'm happy,
and it seems from your example (nice ruby code by the way) that that
is your plan.

Dave


Was this page helpful?
Yes No
Thread at a glance:

Previous Message by Date: click to view message preview

Re: Proposal of some radical changes to API

On Jun 6, 2006, at 4:33 PM, Neville Burnell wrote: >>> I asked the question because I honestly wanted to see a concrete >>> example of an application that couldn't be handled within the >>> constraint of pre- defined fields. > > My current application involves writing a web application which can > seach a ferret index built from a SQL database. > > The idea is that the customer supplies SQLs for say customers, > suppliers, sales and puchases etc. The app then retrieves the rows > from > the datasource and indexes using Ferret. The app provides both a html > website as an interface to the index, and also an XML api which can be > used by non browser clients. > > The field set is quite different for each SQL [and is essentially > out of > our control]. So at what point does your app learn the structure of the SQL table? Would it work if you were to start each session by telling the index writer about the fields that were coming? def connect(field_names) field_names.each do |field_name| index.spec_field(field_name) # use default properties end end def add_to_index(submission) index.add_hash_as_doc(submission) end I can imagine a scenario where that's not possible, and the fields may change up on each insert. In that case, under the interface I envision, you'd have to do something like... def add_to_index(submission) submission.each do |field_name, value| index.spec_field(field_name) # use default properties end index.add_hash_as_doc(submission) end FWIW, this stuff is happening anyway, behind the scenes. Essentially, every time you add a field to an index, Ferret asks, "Say, is this field indexed? And how about TermVectors, you want those?" The 10_000th time you add the field, Ferret asks, "This field wasn't indexed before -- have you changed your mind? OK, I'll check back again later."... 1_000_000th doc: "You sure? How about I make it indexed? Awwwww, c'mon... Hey, could you use some TermVectors?" When it makes sense, of course you want to simplify the interface and hide the complexity inside the library. However, given that it's not possible to make coherent updates to existing data within a Lucene- esque file format, my argument is that field definitions should never change. So the repeated calls to spec_field above would be completely redundant -- you'd get an error if you ever tried to change the field def. Your app would be a little less elegant, it's true (performance impact would be somewhere between insignificant and tiny unless you had a zillion very short fields). However, I think the use case where the fields are not known in advance is the exception rather than the rule. It would also be possible to use Dave's polymorphic hash-as-doc technique, where if the hash value is a Field object, you spec out the field definition using that Field object's properties -- you would just use full-on Field objects for each field. My argument would be, again, that the field definitions should not change. If you don't agree with that and the definition has to be modifiable (within the current constraints), then that single-method technique is probably better. However, if the definition is not modifiable, then I'd argue it's cleaner to separate the two functions. Marvin Humphrey Rectangular Research http://www.rectangular.com/

Next Message by Date: click to view message preview

Re: Proposal of some radical changes to API

>> So at what point does your app learn the structure of the SQL table? At the moment I know the structure after executing the SQL and fetching the first row [a ruby hash]. But the field set will change from SQL to SQL, and Ferret is doing all the field specification for me via hash-as-doc, ala. def create @index = Ferret::Index::Index.new() conn = ODBC.connect(@odbc[:dsn], @odbc[:uid], @odbc[:pwd]) @sqls.each do |sql| stmt = conn.prepare(sql) stmt.execute.each_hash{ |row| @index << row } stmt.close stmt.drop end conn.disconnect end The field definitions do not change though, so I'm happy as long as the hash-as-doc support remains in Ferret. Cheers, Neville -----Original Message----- From: ferret-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@xxxxxxxxxxxxxxxx [mailto:ferret-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@xxxxxxxxxxxxxxxx] On Behalf Of Marvin Humphrey Sent: Thursday, 8 June 2006 3:07 PM To: ferret-talk-GrnCvJ7WPxnNLxjTenLetw@xxxxxxxxxxxxxxxx Subject: Re: [Ferret-talk] Proposal of some radical changes to API On Jun 6, 2006, at 4:33 PM, Neville Burnell wrote: >>> I asked the question because I honestly wanted to see a concrete >>> example of an application that couldn't be handled within the >>> constraint of pre- defined fields. > > My current application involves writing a web application which can > seach a ferret index built from a SQL database. > > The idea is that the customer supplies SQLs for say customers, > suppliers, sales and puchases etc. The app then retrieves the rows > from the datasource and indexes using Ferret. The app provides both a > html website as an interface to the index, and also an XML api which > can be used by non browser clients. > > The field set is quite different for each SQL [and is essentially out > of our control]. So at what point does your app learn the structure of the SQL table? Would it work if you were to start each session by telling the index writer about the fields that were coming? def connect(field_names) field_names.each do |field_name| index.spec_field(field_name) # use default properties end end def add_to_index(submission) index.add_hash_as_doc(submission) end I can imagine a scenario where that's not possible, and the fields may change up on each insert. In that case, under the interface I envision, you'd have to do something like... def add_to_index(submission) submission.each do |field_name, value| index.spec_field(field_name) # use default properties end index.add_hash_as_doc(submission) end FWIW, this stuff is happening anyway, behind the scenes. Essentially, every time you add a field to an index, Ferret asks, "Say, is this field indexed? And how about TermVectors, you want those?" The 10_000th time you add the field, Ferret asks, "This field wasn't indexed before -- have you changed your mind? OK, I'll check back again later."... 1_000_000th doc: "You sure? How about I make it indexed? Awwwww, c'mon... Hey, could you use some TermVectors?" When it makes sense, of course you want to simplify the interface and hide the complexity inside the library. However, given that it's not possible to make coherent updates to existing data within a Lucene- esque file format, my argument is that field definitions should never change. So the repeated calls to spec_field above would be completely redundant -- you'd get an error if you ever tried to change the field def. Your app would be a little less elegant, it's true (performance impact would be somewhere between insignificant and tiny unless you had a zillion very short fields). However, I think the use case where the fields are not known in advance is the exception rather than the rule. It would also be possible to use Dave's polymorphic hash-as-doc technique, where if the hash value is a Field object, you spec out the field definition using that Field object's properties -- you would just use full-on Field objects for each field. My argument would be, again, that the field definitions should not change. If you don't agree with that and the definition has to be modifiable (within the current constraints), then that single-method technique is probably better. However, if the definition is not modifiable, then I'd argue it's cleaner to separate the two functions. Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ Ferret-talk mailing list Ferret-talk-GrnCvJ7WPxnNLxjTenLetw@xxxxxxxxxxxxxxxx http://rubyforge.org/mailman/listinfo/ferret-talk

Previous Message by Thread: click to view message preview

Re: Proposal of some radical changes to API

On Jun 6, 2006, at 4:33 PM, Neville Burnell wrote: >>> I asked the question because I honestly wanted to see a concrete >>> example of an application that couldn't be handled within the >>> constraint of pre- defined fields. > > My current application involves writing a web application which can > seach a ferret index built from a SQL database. > > The idea is that the customer supplies SQLs for say customers, > suppliers, sales and puchases etc. The app then retrieves the rows > from > the datasource and indexes using Ferret. The app provides both a html > website as an interface to the index, and also an XML api which can be > used by non browser clients. > > The field set is quite different for each SQL [and is essentially > out of > our control]. So at what point does your app learn the structure of the SQL table? Would it work if you were to start each session by telling the index writer about the fields that were coming? def connect(field_names) field_names.each do |field_name| index.spec_field(field_name) # use default properties end end def add_to_index(submission) index.add_hash_as_doc(submission) end I can imagine a scenario where that's not possible, and the fields may change up on each insert. In that case, under the interface I envision, you'd have to do something like... def add_to_index(submission) submission.each do |field_name, value| index.spec_field(field_name) # use default properties end index.add_hash_as_doc(submission) end FWIW, this stuff is happening anyway, behind the scenes. Essentially, every time you add a field to an index, Ferret asks, "Say, is this field indexed? And how about TermVectors, you want those?" The 10_000th time you add the field, Ferret asks, "This field wasn't indexed before -- have you changed your mind? OK, I'll check back again later."... 1_000_000th doc: "You sure? How about I make it indexed? Awwwww, c'mon... Hey, could you use some TermVectors?" When it makes sense, of course you want to simplify the interface and hide the complexity inside the library. However, given that it's not possible to make coherent updates to existing data within a Lucene- esque file format, my argument is that field definitions should never change. So the repeated calls to spec_field above would be completely redundant -- you'd get an error if you ever tried to change the field def. Your app would be a little less elegant, it's true (performance impact would be somewhere between insignificant and tiny unless you had a zillion very short fields). However, I think the use case where the fields are not known in advance is the exception rather than the rule. It would also be possible to use Dave's polymorphic hash-as-doc technique, where if the hash value is a Field object, you spec out the field definition using that Field object's properties -- you would just use full-on Field objects for each field. My argument would be, again, that the field definitions should not change. If you don't agree with that and the definition has to be modifiable (within the current constraints), then that single-method technique is probably better. However, if the definition is not modifiable, then I'd argue it's cleaner to separate the two functions. Marvin Humphrey Rectangular Research http://www.rectangular.com/

Next Message by Thread: click to view message preview

Re: Proposal of some radical changes to API

>> So at what point does your app learn the structure of the SQL table? At the moment I know the structure after executing the SQL and fetching the first row [a ruby hash]. But the field set will change from SQL to SQL, and Ferret is doing all the field specification for me via hash-as-doc, ala. def create @index = Ferret::Index::Index.new() conn = ODBC.connect(@odbc[:dsn], @odbc[:uid], @odbc[:pwd]) @sqls.each do |sql| stmt = conn.prepare(sql) stmt.execute.each_hash{ |row| @index << row } stmt.close stmt.drop end conn.disconnect end The field definitions do not change though, so I'm happy as long as the hash-as-doc support remains in Ferret. Cheers, Neville -----Original Message----- From: ferret-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@xxxxxxxxxxxxxxxx [mailto:ferret-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@xxxxxxxxxxxxxxxx] On Behalf Of Marvin Humphrey Sent: Thursday, 8 June 2006 3:07 PM To: ferret-talk-GrnCvJ7WPxnNLxjTenLetw@xxxxxxxxxxxxxxxx Subject: Re: [Ferret-talk] Proposal of some radical changes to API On Jun 6, 2006, at 4:33 PM, Neville Burnell wrote: >>> I asked the question because I honestly wanted to see a concrete >>> example of an application that couldn't be handled within the >>> constraint of pre- defined fields. > > My current application involves writing a web application which can > seach a ferret index built from a SQL database. > > The idea is that the customer supplies SQLs for say customers, > suppliers, sales and puchases etc. The app then retrieves the rows > from the datasource and indexes using Ferret. The app provides both a > html website as an interface to the index, and also an XML api which > can be used by non browser clients. > > The field set is quite different for each SQL [and is essentially out > of our control]. So at what point does your app learn the structure of the SQL table? Would it work if you were to start each session by telling the index writer about the fields that were coming? def connect(field_names) field_names.each do |field_name| index.spec_field(field_name) # use default properties end end def add_to_index(submission) index.add_hash_as_doc(submission) end I can imagine a scenario where that's not possible, and the fields may change up on each insert. In that case, under the interface I envision, you'd have to do something like... def add_to_index(submission) submission.each do |field_name, value| index.spec_field(field_name) # use default properties end index.add_hash_as_doc(submission) end FWIW, this stuff is happening anyway, behind the scenes. Essentially, every time you add a field to an index, Ferret asks, "Say, is this field indexed? And how about TermVectors, you want those?" The 10_000th time you add the field, Ferret asks, "This field wasn't indexed before -- have you changed your mind? OK, I'll check back again later."... 1_000_000th doc: "You sure? How about I make it indexed? Awwwww, c'mon... Hey, could you use some TermVectors?" When it makes sense, of course you want to simplify the interface and hide the complexity inside the library. However, given that it's not possible to make coherent updates to existing data within a Lucene- esque file format, my argument is that field definitions should never change. So the repeated calls to spec_field above would be completely redundant -- you'd get an error if you ever tried to change the field def. Your app would be a little less elegant, it's true (performance impact would be somewhere between insignificant and tiny unless you had a zillion very short fields). However, I think the use case where the fields are not known in advance is the exception rather than the rule. It would also be possible to use Dave's polymorphic hash-as-doc technique, where if the hash value is a Field object, you spec out the field definition using that Field object's properties -- you would just use full-on Field objects for each field. My argument would be, again, that the field definitions should not change. If you don't agree with that and the definition has to be modifiable (within the current constraints), then that single-method technique is probably better. However, if the definition is not modifiable, then I'd argue it's cleaner to separate the two functions. Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ Ferret-talk mailing list Ferret-talk-GrnCvJ7WPxnNLxjTenLetw@xxxxxxxxxxxxxxxx http://rubyforge.org/mailman/listinfo/ferret-talk
Sign up for updates to this mailing list. email:
Loading Comments...
Home | News | Patents | Sitemap | FAQ | advertise

Advertising by