mongodb slow query analyzer

秦鸿羽

2023-12-01

One of my favorite MySQL tools ever is pt-query-digest. It's the tool you use to generate a report of your slow query log (or some other supported sources), and is similar to for example the Query Analyzer in MySQL Enterprise Monitor. Especially in the kind of job I am, where I often just land out of nowhere on a server at the customer site, and need to quickly get an overview of what is going on, this tool is always the first one I run. I will show some examples below.

When I started engaging with MongoDB projects at Nokia, I needed some project of my own to tinker with. So I decided to add support for MongoDB "slow query log". Back then, my first task was to figure out if MongoDB even has such logging for slow queries. Lucky me, it has: it's known as theMongoDB Profiler.

Nokia allowed me to contribute changes back upstream to Percona. In the past weeks I finally had some time to upgrade it to understand the changed log format of MongoDB 2.4. (Earlier versions are no longer supported.)

And therefore, I now proudly present to you, pt-query-digest with MongoDB support:https://gist.github.com/henrikingo/6065357/raw/pt-query-digest-mongo(or branch the whole bzr repo:https://code.launchpad.net/~hingo/percona-toolkit/pqd-mongodb-24)

Btw, the other reason I like this and other Percona Toolkit tools: They ship with all their dependencies embedded into the same file. This means, when I'm on a customer server without root access, I can just do

wget https://gist.github.com/henrikingo/6065357/raw/pt-query-digest-mongo

...and I'm good to go! Command line tools for the win :-)

Example usage

My pt-query-digest supports reading slow queries from the mongodb log file. To get some queries there, you need to set the threshold to lower than 100 ms. In fact, I like to collect as much as possible, so I set it to 1 ms. It's an unfortunate limitation that I cannot set the limit to 0 ms in Mongo.

db.setProfilingLevel(1,1)

Then run pt-query digest with the --type=mongolog option. Using --limit is optional and means the same is in normal pt-query-digest, see --help for other options.

pt-query-digest --type=mongolog --limit=20 < ../mongodb.log

At the top of the output you first get some summary information of all the queries in the log.

# 19.2s user time, 50ms system time, 25.96M rss, 95.67M vsz
# Current date: Tue Jul 23 14:21:53 2013
# Hostname: hingo-sputnik
# Files: STDIN
# Overall: 84.96k total, 9 unique, 0 QPS, 0x concurrency _________________
# Time range: Sat Jul 20 08:08:57 to Jul
# Attribute          total     min     max     avg     95%  stddev  median
# ============     ======= ======= ======= ======= ======= ======= =======
# Exec time          2404s     2ms    83ms    28ms    42ms    13ms    33ms
# Rows sent        665.96k       0     100    8.03   97.36   26.46       0
# Rows examine     733.62k       0     200    8.84   97.36   30.16       0
# Query size         3.33M       6     173   41.09   76.28   30.35   31.70
# Key updates         1020       0       1    0.01       0    0.11       0
# Result bytes      97.74M       0  12.91k   1.18k  12.50k   3.47k   92.72
# Rlock time           29s       0    18ms   341us     3ms   972us       0
# Scan & order       6.11k       0       1    0.07    0.99    0.26       0
# Wlock time         392ms       0     7ms     4us       0    63us       0
# Yields                 0       0       0       0       0       0       0

From the above we can already see that we are looking at a fairly well behaved system. For example Execution time for all queries is well below 100 ms (so wouldn't even show up by default in the log). Largest result sizes are just below 13k, there was really little time spent in write lock, and so forth.

Next we get a summary of the breakdown per query. Now we get to the parts where pt-query-digest really shines. The typical use of a slow query log would be to just look at whichever queries happen to be slowest, and then try to fix them. Of course, if you have a query that takes tens of seconds to run, there may or may not be something wrong. But experience shows, your energy is often better spent somewhere else completely, and pt-query-digest will tell you where to focus.

Pt-query-digest instead will de-parameterize all of the queries, so that it can recognize which queries are actually the same, just that the parameters were different. It then sums up and sorts all these queries (by default) by execution time. A typical application will only have between 10 to 100 different queries, so this is the perfect tool to give you an overview of what the app is doing to your database:

# Profile
# Rank Query ID           Response time   Calls R/Call V/M   Item
# ==== ================== =============== ===== ====== ===== =============
#    1 0xA9297704DC845CFE 2354.4260 97.9% 71264 0.0330  0.00 COMMAND sbtest.$cmd
#    2 0x056DECD9CABFBFFA   18.1960  0.8%  4171 0.0044  0.00 COMMAND sbtest.$cmd
#    3 0xF6661101714FC832   17.3660  0.7%  6257 0.0028  0.00 QUERY sbtest.sbtest1
#    4 0x102C8F18966084AF    7.8110  0.3%  1383 0.0056  0.00 UPDATE sbtest.sbtest1
#    5 0x411E607C59AC3D1F    2.1300  0.1%   440 0.0048  0.01 INSERT sbtest.sbtest1
#    6 0xE097817C3B6F6EE3    1.4490  0.1%   565 0.0026  0.00 QUERY sbtest.sbtest1
#    7 0xFC580F8AC66DC809    1.1850  0.0%   442 0.0027  0.00 COMMAND sbtest.$cmd
#    8 0x094C2DDF61C1708A    1.1330  0.0%   253 0.0045  0.01 REMOVE sbtest.sbtest1
#    9 0x9696A6AE83C26BF7    0.5080  0.0%   188 0.0027  0.00 QUERY sbtest.sbtest1

Now, if this was a MySQL based application, I would be very worried with the above results. There is one query that is taking up 98% of the total execution time of everything that happens in the database! It turns out, the benchmark I used to generate some "noise" for this example was set to use WriteConcern w=1, so the top line is simply wait time resulting from getLastError calls:

# Query 1: 0 QPS, 0x concurrency, ID 0xA9297704DC845CFE at byte 8790360 __
# Scores: V/M = 0.00
# Time range: Sat Jul 20 08:08:57 to Jul
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         83   71264
# Exec time     97   2354s     2ms    83ms    33ms    44ms     8ms    34ms
# Rows sent      0       0       0       0       0       0       0       0
# Rows examine   0       0       0       0       0       0       0       0
# Query size    65   2.17M      32      32      32      32       0      32
# Key updates    0       0       0       0       0       0       0       0
# Result bytes   6   5.98M      79     169   88.02   92.72    8.26   92.72
# Rlock time     0       0       0       0       0       0       0       0
# Scan & order   0       0       0       0       0       0       0       0
# Wlock time     0       0       0       0       0       0       0       0
# Yields         0       0       0       0       0       0       0       0
# String:
# Databases    sbtest.$cmd
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms  #
#  10ms  ################################################################
# 100ms
#    1s
#  10s+
command sbtest.$cmd { getlasterror: 1, fsync: true }

If your application is doing this, it might not be what you want. As you can see, your clients end up waiting a lot compared to what it actually takes to execute the real queries. But for the MongoDB server this is actually harmless, it's not executing or blocking anything, it's just you that is waiting all the time.

Ok, let's fix that and run pt-query-digest again. We get a different distribution that's more interesting to look at:

# Profile
# Rank Query ID           Response time Calls R/Call V/M   Item
# ==== ================== ============= ===== ====== ===== ===============
#    1 0x056DECD9CABFBFFA 18.1960 36.6%  4171 0.0044  0.00 COMMAND sbtest.$cmd
#    2 0xF6661101714FC832 17.3660 34.9%  6257 0.0028  0.00 QUERY sbtest.sbtest1
#    3 0x102C8F18966084AF  7.8110 15.7%  1383 0.0056  0.00 UPDATE sbtest.sbtest1
#    4 0x411E607C59AC3D1F  2.1300  4.3%   440 0.0048  0.01 INSERT sbtest.sbtest1
#    5 0xE097817C3B6F6EE3  1.4490  2.9%   565 0.0026  0.00 QUERY sbtest.sbtest1
#    6 0xFC580F8AC66DC809  1.1850  2.4%   442 0.0027  0.00 COMMAND sbtest.$cmd
#    7 0x094C2DDF61C1708A  1.1330  2.3%   253 0.0045  0.01 REMOVE sbtest.sbtest1
#    8 0x9696A6AE83C26BF7  0.5080  1.0%   188 0.0027  0.00 QUERY sbtest.sbtest1

So it seems we have 2 queries that together consume over 70% of the total execution time spent in the database and the top 3 queries use up 95% of your MongoDB capacity! This means, it makes sense to focus solely on optimizing these top 2-3 queries and ignore everything else.

This is what they look like in more detail:

# Query 1: 0 QPS, 0x concurrency, ID 0x056DECD9CABFBFFA at byte 389327 ___
# Scores: V/M = 0.00
# Time range: Sat Jul 20 08:11:09 to Jul
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         30    4171
# Exec time     36     18s     2ms    46ms     4ms    11ms     4ms     3ms
# Rows sent      0       0       0       0       0       0       0       0
# Rows examine   0       0       0       0       0       0       0       0
# Query size    58 695.74k     167     173  170.81  166.51       0  166.51
# Key updates    0       0       0       0       0       0       0       0
# Result bytes   0 309.57k      76      76      76      76       0      76
# Rlock time    21      6s   357us     7ms     1ms     3ms   930us     1ms
# Scan & order   0       0       0       0       0       0       0       0
# Wlock time     0       0       0       0       0       0       0       0
# Yields         0       0       0       0       0       0       0       0
# String:
# Databases    sbtest.$cmd
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms  ################################################################
#  10ms  #####
# 100ms
#    1s
#  10s+
command sbtest.$cmd { aggregate: "sbtest1", pipeline: [ 
{ $match: { _id: { $gte: 8184, $lte: 8283 } } }, 
{ $project: { k: 1, _id: 0 } },
{ $group: { _id: null, average: { $sum: "$k" } } } 
] }

# Query 2: 0 QPS, 0x concurrency, ID 0xF6661101714FC832 at byte 2281363 __
# Scores: V/M = 0.00
# Time range: Sat Jul 20 08:09:10 to Jul
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         45    6257
# Exec time     34     17s     2ms    61ms     3ms     5ms     2ms     2ms
# Rows sent     91 610.80k      97     100   99.96   97.36    0.06   97.36
# Rows examine  92 677.46k      97     200  110.87  192.76   29.75   97.36
# Query size    34 408.30k      63      68   66.82   65.89    0.88   65.89
# Key updates    0       0       0       0       0       0       0       0
# Result bytes  85  78.85M  12.52k  12.91k  12.91k  12.50k    0.00  12.50k
# Rlock time    66     19s     2ms    18ms     3ms     5ms     1ms     3ms
# Scan & order 100   6.11k       1       1       1       1       0       1
# Wlock time     0       0       0       0       0       0       0       0
# Yields         0       0       0       0       0       0       0       0
# String:
# Databases    sbtest.sbtest1
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms  ################################################################
#  10ms  #
# 100ms
#    1s
#  10s+
query sbtest.sbtest1 { $query: { _id: { $gte: 7957, $lte: 8056 } }, $orderby: { c: 1 } }

# Query 3: 0 QPS, 0x concurrency, ID 0x102C8F18966084AF at byte 1229838 __
# Scores: V/M = 0.00
# Time range: Sat Jul 20 08:11:13 to Jul
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         10    1383
# Exec time     15      8s     2ms    44ms     6ms    13ms     4ms     4ms
# Rows sent      0       0       0       0       0       0       0       0
# Rows examine   0    1020       0       1    0.74    0.99    0.44    0.99
# Query size     1  17.41k      10      14   12.89   12.54    0.24   12.54
# Key updates  100    1020       0       1    0.74    0.99    0.44    0.99
# Result bytes   0       0       0       0       0       0       0       0
# Rlock time     0       0       0       0       0       0       0       0
# Scan & order   0       0       0       0       0       0       0       0
# Wlock time    57   226ms    19us     7ms   163us   301us   319us   119us
# Yields         0       0       0       0       0       0       0       0
# String:
# Databases    sbtest.sbtest1
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms  ################################################################
#  10ms  ##########
# 100ms
#    1s
#  10s+
update sbtest.sbtest1 { _id: 4482 }

As you can see, there is a lot of information presented for each query. And thanks to the wonder of open source and a few regular expressions, I can now benefit from this reporting tool also in my work with MongoDB.

If you want to read up on what each of the columns mean, consult thePercona Toolkit documentation.

I used Tim Callaghan'sSysbench for MongoDBto generate some queries for this demo.

mongodb slow query analyzer

Example usage

相关阅读

相关文章

相关问答