Napkin math for MongoDB performance

阎祖鹤

2023-12-01

文章来源：http://rickosborne.org/blog/2010/02/napkin-math-for-mongodb-performance/

As we all know, there are lies, damned lies, and statistics. What I’m about to present shouldn’t even qualify as statistics—it’s just a bunch of damned lies. I’m not set up to do any sort of rigorous performance testing, so these should not be construed as anything but what they are: one guy’s half-assed and probably flawed measurements.

I was playing around with MapReduce on MongoDB, trying to figure out how to code the equivalent of SQL’s COUNT(DISTINCT column) functionality. The short answer is: don’t do it . Or, if you do it, figure out a better way than I did. Along the way, I gathered some metrics on what types of operations cause what kinds of performance hits.

The Setup

My set up is a database of 3,397,115 records, all of which look something like this:

{ "_id" : 3002827, "mm" : 7, "stars" : 5, "date" : "2005-07-18", "dd" : 18, "cust" : 2213, "movie" : 14889, "yy" : 2005, "title" : "Species", "year" : 1995 }

Yeah, I just took the Netflix prize data and inserted ~3M records. I did the inserts across 3 shard services, all running on the same machine, which led to 9 chunks of roughly equal size. I let MongoDB handle the sharding—I didn’t manually split the shards. I ensured one index on the collection, over movie and cust , which isn’t really used for the query in question, but I thought it was worth mentioning.

Yeah, I know performance is going to suffer because I’m running 3 shards from the same hard drive. That’s kindof the point.

I ran all of this on my MacBook Pro, which is a 2.66 GHz Core 2 Duo with 4GB of 1067 MHz DDR3. I continued to do other light-duty tasks while running the tests, but nothing that should have interfered greatly.

The Queries

Here’s the starting query’s SQL equivalent:

SELECT releaseYear, COUNT(*) AS nRecords, COUNT(DISTINCT movie) AS mMovies, COUNT(DISTINCT cust) AS cCustomers, SUM(stars) AS totalStars, AVG(stars) AS avgStars FROM training WHERE (releaseYear = 1990) GROUP BY releaseYear

And the MapReduce query itself, as I wrote it:

db.runCommand({ mapreduce: "training", query: { year: 1990 }, map: function() { var m = {}, c = {}; m[this.movie] = true; c[this.cust] = true; emit( this.year, { "stars": this.stars, "n": 1, "m": m, "c": c } )}, reduce: function(key, vals) { var stars = 0, n = 0, m = {}, c = {}; for(var i = 0; i < vals.length; i++) { var v = vals[i]; stars += v.stars; n += v.n; for (var im in v.m) m[im] = true; for (var ic in v.c) c[ic] = true; } return { "stars": stars, "n": n, "m": m, "c": c }; }, finalize: function(key, val) { val.avg = val.stars / val.n; var m = 0, c = 0; for (var im in val.m) m++; for (var ic in val.c) c++; val.m = m; val.c = c; return val; }, out: "result1", verbose: true });

Those nasty bits with the for-in loops are for the COUNT(DISTINCT column) logic. This query produces the following result set:

{ "_id" : 1990, "value" : { "stars" : 593179, "n" : 154617, "m" : 7, "c" : 120259, "avg" : 3.8364410123078314 } }

The Results

All times below are in mm:ss format. (Minutes, not hours.)

Query	Total Time	Shards Time	Final Function
1	10:44	03:46	06:58
1	This was the starting query above, as written.
2	90:48	36:26	54:22
2	I widened the release year restriction from just 1990 to 1990-1999, via `{ year: { $gte: 1990, $lte: 1999 } }` . That's close to a linear relationship between emitted records and time elapsed.
3	21:33	13:53	07:40
3	I used `movechunk` to consolidate all of the chunks on one shard server, then shut down the other two. I reduced the release year restriction back to just 1990. It takes 2x longer than the first query, presumably due to disk bottlenecks? One shard trying to reduce 9 chunks at once?
4	02:08	02:08	-
4	I removed the `for-in` loops and `COUNT(DISTINCT)` logic, leaving only the plain record count and average, but was still on the one shard server, implying a 10x slowdown for that type of logic.
Query	Total Time	Map Time	Emit Loop
5	00:13	00:06	00:13
5	I connected to the one remaining shard directly, instead of through `mongos` , and ran the previous query (no `for-in` ). Again, this implies a 10x slowdown due to trying to process chunks simultaneously.
6	05:24	00:15	01:14
6	Still connected directly to the one shard (no `mongos` ) with all of the records, I ran the original query (with `for-in` logic). A slowdown of 25x seems a little high, but I ran the query twice to verify it.

Lessons Learned

Queries scream when a single shard is left to its own devices—but when parallelism is attempted on the same shard you get a massive performance hit. Don't run different shards off the same hard drive—no matter how many cores you have.
Don't try to emulate COUNT(DISTINCT) . Really.

I have to wonder if mongos can be tweaked to serialize queries against chunks on the same shard, to prevent disk contention issues?

MongoDB's performance on aggregation queries

Is this Map Reduce performance normal or I am missing something

Napkin math for MongoDB performance

The Setup

The Queries

The Results

Lessons Learned

相关阅读

相关文章

相关问答