Basic Aggregation in MongoDB 2.1 with Python

jieforest發表於2012-06-10
Why a new framework?

If you've been following along with this article series, you've been introduced to MongoDB's mapreducecommand, which up until MongoDB 2.1 has been the go-to aggregation tool for MongoDB. (There's also the group() command, but it's really no more than a less-capable and un-shardable version of mapreduce(), so we'll ignore it here.)

So if you already have mapreduce() in your toolbox, why would you ever want something else?

Mapreduce is hard; let's go shopping

The first motivation behind the new framework is that, while mapreduce() is a flexible and powerful abstraction for aggregation, it's really overkill in many situations, as it requires you to re-frame. your problem into a form. that's amenable to calculation using mapreduce().

For instance, when I want to calculate the mean value of a property in a series of documents, trying to break that down into appropriate map, reduce, and finalize steps imposes some extra cognitive overhead that we'd like to avoid. So the new aggregation framework is (IMO) simpler.

The Javascript. global interpreter lock is evil

The MapReduce algorithm, the basis of MongoDB's mapreduce() command, is a great approach to solving Embarrassingly Parallel problems.

Each invocation of map, reduce, and finalize is completely independent of the others (though the map/reduce/finalize phases are order-dependent), so we shouldbe able to dispatch these jobs to run in parallel without any problems.

Unfortunately, due to MongoDB's use of the SpiderMonkey Javascript. engine, each mongod process is restricted to running a single Javascript. thread at a time.

So in order to get any parallelism with a MongoDB mapreduce(), you must run it on a sharded cluster, and on a cluster with N shards, you're limited to N-way parallelism.

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/301743/viewspace-732370/,如需轉載,請註明出處,否則將追究法律責任。

相關文章