osdir.com
mailing list archive

Subject: Re: The Blog - msg#00129

List: couchdb-user

Date: Prev Next Index Thread: Prev Next Index
> > CouchDB won't allow you to "jump to page X", but if you look at
> > e.g. Google, it doesn't work either. [...]
> > But surrogate keys are considered harmful and I'd say (but that
> > really depends on the application), not very helpful.
>
> I guess I was assuming that CouchDB, due to its different nature, has
> a sophisticated solution for this. But apparently pagination is a
> problem that is really hard to solve.

It seems to me that CouchDB is at least no worse than an RDBMS here.

Any RDBMS I know builds its indexes using B-trees. So if you do

SELECT ... FROM db ORDER BY k OFFSET 5000 LIMIT 10

then you're forcing the SQL database to traverse its B-tree index for 5000
entries, then retrieve the next 10, then use those to find the rows you're
interested in.

If my understanding is right, then exactly the same is true of couchdb if
you use the skip/limit options on a view.

Both can use relative paging (e.g. SELECT ... WHERE k >= startkey) if you're
only interested in "next page", "previous page". That's what I'd use for
very large datasets. You can easily do links for the next 10 pages (say), by
selecting more than you need to display in the first page.

However, couchdb offers you a number of options which a SQL database
doesn't. For instance:

1. When you generate your view, you can emit the entire document (or any
selection of fields of special interest) as the value. This means that your
index query which returns 10 keys can return the 10 documents as well; a SQL
database may have to do 10 additional head seeks to return the rows.

There is a tradeoff in index disk space used, of course, but the choice is
up to you.

2. Updating. If you do 500 INSERTs followed by one SELECT in a SQL database,
unless you use some admin-level tricks like temporarily disabling indexing,
all affected indexes will be updated for every INSERT.

With couchdb, you'll get a single update of all indexes when the SELECT
takes place. This may add some latency, but it's far less work than updating
the indexes 500 times.

3. The reduce data structure is extremely smart. If there are N documents
stored in one B-tree node, then the pre-computed reduce value for those N
documents is stored in that node too.

So if you ask for an aggregate value from K1..Kn, and this spans some whole
blocks of B-tree nodes, only the end ones need to be re-reduced:

________K1___ _______________ ________________ _____K4_________
<---> <-------------> <--------------> <----->
reduce already already reduce
R1' reduced R2 reduced R3 R4'

and couchdb just calculates reduce(R1',R2,R3,R4') to get the final answer.

In principle, an RDBMS could use the same kind of logic for

select count(*) from db where k BETWEEN 'k1' AND 'k4'

but I don't know if any of them do. I highly doubt that they do it for
arbitrary aggregation functions like

select sum(n) from db where k BETWEEN 'k1' AND 'k4'

Couchdb makes this trivial and highly efficient, because you explicitly ask
for which summary values you want to be handled in this way.

The downside, of course, is that you have to *plan* this in couchdb, by
building your views appropriately. A SQL database can take any arbitrary
query, and have a stab at it using whatever combination of index scans and
table reads it thinks is appropriate. But having seen SQL databases make
very bad decisions in this area, I don't consider this something to trumpet
about.

The other downside is when doing joins across multiple 'tables', which in
couchdb would be one document cross-referencing to another. You have to
build your view with multiple rows, one from each document, and combine them
in the client. This isn't particularly hard, but it does negate the reduce
functionality.

> You still need one lookup for every blog entry on a page.
> And there is no way you can ever store the comment count inside the blog
> entry.

I'm not sure what an RDBMS offers here that couchdb does not.

A simple map/reduce query will give you a count of all the comments for a
blog entry (and it will scale to millions of comments).

Sure, you can construct a SQL join which gives you the blog entry plus its
comments count in one go, but the SQL database is doing the same sort of
work behind the scenes.

If you have multiple blog entries on a page, a single couchdb group query
can give you all the comment counts in one go. If the keyspace is contiguous
(e.g. by blog posting date) then it's easy (*). And even if not, you can use
the POST multiple-fetch API to get all the comments counts for an arbitrary
set of blog entries in one request.

But perhaps I'm missing something from your requirements.

Regards,

Brian.

(*) If each comment document has a blog_entry_id, then you can emit
something like

keys values

["2009/02/01/entry1","comment1"] null
["2009/02/01/entry1","comment2"] null
["2009/02/09/entry2","comment3"] null
["2009/02/09/entry2","comment4"] null
["2009/02/09/entry2","comment5"] null

Use a counter map-reduce function:

function(ks, vs, co) {
if (co) {
return sum(vs);
} else {
return vs.length;
}
}

For the comment counts for all blog entries this month, ask for

group=true&group_level=1&startkey=["2009/02/01"]&endkey=["2009/03/01"]

Getting the text of all these blog entries would be a separate query.

I think this shows that in couchdb, there is an advantage to using a doc id
which has relevance to the application.

The SQL normalization brigade would say use a random uuid for every blog
entry and every comment. If you do, I agree that makes it a bit harder to do
this sort of aggregation. But I think the multi-fetch API should still work.

Was this page helpful?
Yes No
Thread at a glance:

Previous Message by Date: click to view message preview

Re: The Blog

On Mon, Feb 09, 2009 at 09:51:18AM -0500, Adam Petty wrote: > Could this thread be added to the wiki - with only minor editing for length > - maybe as "a RDBMS vs couch 'Discussion' ?" or something similar?"... We've learnt from the book that such comparisons tend to be harmful. They lead people into thinking that there is a direct meaningful comparison. Fundamentally, CouchDB and RDMS solve different problems. -- Noah Slater, http://tumbolia.org/nslater

Next Message by Date: click to view message preview

Re: [user] Re: The Blog

On Feb 9, 2009, at 3:57 PM, Noah Slater wrote: On Mon, Feb 09, 2009 at 09:51:18AM -0500, Adam Petty wrote: Could this thread be added to the wiki - with only minor editing for length - maybe as "a RDBMS vs couch 'Discussion' ?" or something similar?"... We've learnt from the book that such comparisons tend to be harmful. They lead people into thinking that there is a direct meaningful comparison. Fundamentally, CouchDB and RDMS solve different problems. I dunno, I think it would be interesting to compare the main benefits of each so that you know what the strong points of each are. For example, suppose you implement schema-free in an RDBMS by adding a text field that contains a JSON string. You still keep some of the metadata, like _rev and _id, in proper fields. However, thinking about that, it means you will need to re-implement everything CouchDB does, like views and replication. To be honest, I think saying RDBMS and CouchDB are for different solutions is just you guys being nice. I think that any application would benefit from using the CouchDB model and only in very specific, very demanding cases an RDBMS would be better. I can't think of any examples though. So here's my challenge to the mailing list, it's pretty much the same one that MrDonut posted: Give us an example of something that would be better be done with an RDBMS and something that would better be done with CouchDB. I'll help you: I think it would be easier to create a wiki with CouchDB than with an RDBMS. It is possible in both but CouchDB just makes it easier. I suppose we'd have to ask the http://couch.it guys to know if that's true. I don't know what would be done better in an RDBMS. Performance logging perhaps? Something with really stringent schema requirements? Wout.

Previous Message by Thread: click to view message preview

Re: The Blog

On 10/02/2009, at 12:57 AM, Mister Donut wrote: I have another contract about to start for a server app where all the data is maintained on the client's desktop, previewed with full functionality, and then replicated to an EC2 instance. This can be done with traditional databases, but it's trivial with CouchDB, Well, this is trivial with all databases? Just import and export. It's just copying a file. Now imagine two users working on the data. Yes, you have replication built in, so no data gets lost. But in Couch it's both hot and incremental, and requires no configuration/scripting etc. No copying of files. Just a single http request. Antony Blakey -------------------------- CTO, Linkuistics Pty Ltd Ph: 0438 840 787 If at first you don’t succeed, try, try again. Then quit. No use being a damn fool about it -- W.C. Fields

Next Message by Thread: click to view message preview

Re: The Blog

On Mon, Feb 9, 2009 at 5:28 AM, Mister Donut <lady.donut@xxxxxxxxx> wrote: > I don't think you understand my point. > Yes, I know. Maybe you should re-read. > You still need one lookup for every blog entry on a page. No, you can do so with only one query using Map/Reduce. >> startkey, endkey and limit. > > That sounds so great. But wait. LIMIT. > I know that from SQL. It doesn't scale. That's what caching is for. > Jumping to page 1234567 of ten million. Please, no. > What's the point of that? Unless you expect your number of comments to never grow, in which case caching will do. > And you cannot, ever, group items based on a variable criteria. You can do that with Map/Reduce. > I challenge you. Build me a counter! Create a view that gets all the comments and get them with limit=0, there's your counter.
Loading Comments...
Home | News | Patents | Sitemap | FAQ | advertise

Advertising by