ch15,Delivering a SQL Server Health Check--PART 3

申查猛

2023-12-01

一、前言

做为DBA，在接手一台全新的或是不熟悉的SQL Server DB服务器时，首先需要做的，就是尽可能详尽的收集这台服务器的相关信息，这包括硬件信息和存储子系统以及OS相关信息，SQL server instance本身的信息等。做为数据库专业人士，尤其是对于一个专业的DBA，这是必须知道的基本信息。

对于可以接直接登录到DB server OS的DBA，则可以通过登录到OS直接收集相关信或使用DMV来收集，而对于不能或是不被允许直接登录到OS的DBA，则只有通过使用DMV/DMF了。

二，使用DMV/DMF收集DB SERVER相关信息

A，获取当前实例自最近一次启动后累积的等待信息

注：自最近一次启动后累积地说法稍有不妥，因为DMV/DMF的

信息是存在内存中的，如果SQL instance有内存压力，那么存放在

DMV/DMF中的信息由于内存压力的缘故可能被清除掉，这是需要注意

的地方。

-- Isolate top waits for server instance since last restart or statistics clear

WITH Waits AS

(SELECT wait_type, wait_time_ms / 1000. AS wait_time_s,

100. * wait_time_ms / SUM(wait_time_ms) OVER() AS pct,

ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS rn

FROM sys.dm_os_wait_stats WITH (NOLOCK)

WHERE wait_type NOT IN (N'CLR_SEMAPHORE',N'LAZYWRITER_SLEEP',N'RESOURCE_QUEUE',

N'SLEEP_TASK',N'SLEEP_SYSTEMTASK',N'SQLTRACE_BUFFER_FLUSH',N'WAITFOR',

N'LOGMGR_QUEUE',N'CHECKPOINT_QUEUE', N'REQUEST_FOR_DEADLOCK_SEARCH',

N'XE_TIMER_EVENT',N'BROKER_TO_FLUSH',N'BROKER_TASK_STOP',N'CLR_MANUAL_EVENT',

N'CLR_AUTO_EVENT',N'DISPATCHER_QUEUE_SEMAPHORE', N'FT_IFTS_SCHEDULER_IDLE_WAIT',

N'XE_DISPATCHER_WAIT', N'XE_DISPATCHER_JOIN', N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',

N'ONDEMAND_TASK_QUEUE', N'BROKER_EVENTHANDLER', N'SLEEP_BPOOL_FLUSH',

N'DIRTY_PAGE_POLL', N'HADR_FILESTREAM_IOMGR_IOCOMPLETION',

N'SP_SERVER_DIAGNOSTICS_SLEEP'))

SELECT W1.wait_type,

CAST(W1.wait_time_s AS DECIMAL(12, 2)) AS wait_time_s,

CAST(W1.pct AS DECIMAL(12, 2)) AS pct,

CAST(SUM(W2.pct) AS DECIMAL(12, 2)) AS running_pct

FROM Waits AS W1

INNER JOIN Waits AS W2

ON W2.rn <= W1.rn

GROUP BY W1.rn, W1.wait_type, W1.wait_time_s, W1.pct

HAVING SUM(W2.pct) - W1.pct < 99 OPTION (RECOMPILE); -- percentage threshold

B，获取实例等待CPU的信息

-- Signal Waits for instance

--Signal waits are CPU-related waits

-- Signal Waits above 15-20% is usually a sign of CPU pressure

--Signal Waits表示等待CPU的时间，超过总等待时间的15~20%,表示有CPU压力

SELECT CAST(100.0 * SUM(signal_wait_time_ms) / SUM (wait_time_ms) AS NUMERIC(20,2))

AS [%signal (cpu) waits],

CAST(100.0 * SUM(wait_time_ms - signal_wait_time_ms) / SUM (wait_time_ms) AS NUMERIC(20,2)) AS [%resource waits]

FROM sys.dm_os_wait_stats WITH (NOLOCK) OPTION (RECOMPILE);

C，Login count information

-- Get logins that are connected and how many sessions they have

-- This can help characterize your workload and

-- determine whether you are seeing a normal level of activity

SELECT login_name, COUNT(session_id) AS [session_count]

FROM sys.dm_exec_sessions WITH (NOLOCK)

GROUP BY login_name

ORDER BY COUNT(session_id) DESC OPTION (RECOMPILE);

D，获得平均任务数

--Average Task Count information

--运行多次，如果返回的值持续超过10，则需要进一步调查

-- High Avg Task Counts are often caused by blocking or other resource contention

-- High Avg Runnable Task Counts are a good sign of CPU pressure，等待CPU的平均task数

-- High Avg Pending DiskIO Counts are a sign of disk pressure，等待IO的平均task数

-- Get Average Task Counts (run multiple times)

SELECT AVG(current_tasks_count) AS [Avg Task Count],

AVG(runnable_tasks_count) AS [Avg Runnable Task Count],

AVG(pending_disk_io_count) AS [Avg Pending DiskIO Count]

FROM sys.dm_os_schedulers WITH (NOLOCK)

WHERE scheduler_id < 255 OPTION (RECOMPILE);

E,获取CPU利用率的历史数据

本查询返回当前实例在最近的256分钟之内的CPU使用率

每分钟取样一次，主要关注CPU的使用率趋势：

--CPU utilization history

-- Look at the trend over the entire period.

-- Also look at high sustained Other Process CPU Utilization values

-- Get CPU Utilization History for last 256 minutes (in one minute intervals)

-- This version works with SQL Server 2008 and above

DECLARE @ts_now bigint = (SELECT cpu_ticks/(cpu_ticks/ms_ticks)

FROM sys.dm_os_sys_info WITH (NOLOCK));

SELECT TOP(256) SQLProcessUtilization AS [SQL Server Process CPU Utilization],

SystemIdle AS [System Idle Process],

100 - SystemIdle - SQLProcessUtilization

AS [Other Process CPU Utilization],

DATEADD(ms, -1 * (@ts_now - [timestamp]),

GETDATE()) AS [Event Time]

FROM (SELECT record.value('(./Record/@id)[1]', 'int') AS record_id,

record.value('(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int')

AS[SystemIdle],record.value('(./Record/SchedulerMonitorEvent/SystemHealth/

ProcessUtilization)[1]','int')

AS [SQLProcessUtilization], [timestamp]

FROM (SELECT [timestamp], CONVERT(xml, record) AS [record]

FROM sys.dm_os_ring_buffers WITH (NOLOCK)

WHERE ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR'

AND record LIKE N'%%') AS x

) AS y

ORDER BY record_id DESC OPTION (RECOMPILE);

F，获取OS memory信息

-- Good basic information about OS memory amounts and state

-- You want to see "Available physical memory is high"

-- This indicates that you are not under external memory pressure

SELECT total_physical_memory_kb, available_physical_memory_kb,

total_page_file_kb, available_page_file_kb,

system_memory_state_desc

FROM sys.dm_os_sys_memory WITH (NOLOCK) OPTION (RECOMPILE);

G，获取SQL SERVER 内存信息

--SQL server memory information

-- You want to see 0 for process_physical_memory_low

-- You want to see 0 for process_virtual_memory_low

-- This indicates that you are not under internal memory pressure

-- SQL Server Process Address space info

--(shows whether locked pages is enabled, among other things)

SELECT physical_memory_in_use_kb,locked_page_allocations_kb,

page_fault_count, memory_utilization_percentage,

available_commit_limit_kb, process_physical_memory_low,

process_virtual_memory_low

FROM sys.dm_os_process_memory WITH (NOLOCK) OPTION (RECOMPILE);

H，Page Life Expectancy information

-- Page Life Expectancy (PLE) value for default instance

-- PLE is one way to measure memory pressure.

-- Higher PLE is better. Watch the trend, not the absolute value.

--PLE is a measurement of how long SQL Server expects to keep in the SQL Server

--buffer pool before it is fl ushed or evicted. Higher PLE values are better than lower PLE values

--业界公认，PLE<300，系统具有性能问题，通常认为PLE越大越好，虽然有人认为，PLE大未必就表示系统性能

很好，观察此值时，不要只看某个时间点的绝对值，更重要的观察随着时间的推移，此值的变化趋势

SELECT cntr_value AS [Page Life Expectancy（seconds）]

FROM sys.dm_os_performance_counters WITH (NOLOCK)

WHERE [object_name] LIKE N'%Buffer Manager%' -- Handles named instances

AND counter_name = N'Page life expectancy' OPTION (RECOMPILE);

I，获取Memory Grants Outstanding 信息

-- Memory Grants Outstanding value for default instance

-- Memory Grants Outstanding above zero

-- for a sustained period is a secondary indicator of memory pressure

--查询当前sql server instance Memory Grants Outstanding值，此值若持续大于0，

--则侧面证明可能存在内存压力。此值代表在SQL server进程中成功获取到workspace memory

的进程总数，理想值应为0，因为workspace memory代表查询在做sorting和hashing

SELECT cntr_value AS [Memory Grants Outstanding]

FROM sys.dm_os_performance_counters WITH (NOLOCK)

WHERE [object_name] LIKE N'%Memory Manager%' -- Handles named instances

AND counter_name = N'Memory Grants Outstanding' OPTION (RECOMPILE);

J，获取Memory Grants Pending

-- Memory Grants Pending value for default instance

-- Memory Grants Pending above zero

-- for a sustained period is an extremely strong indicator of memory pressure

--[Memory Grants Pending]必须始终为0，否则表示系统具有严重的内存压力

SELECT cntr_value AS [Memory Grants Pending]

FROM sys.dm_os_performance_counters WITH (NOLOCK)

WHERE [object_name] LIKE N'%Memory Manager%' -- Handles named instances

AND counter_name = N'Memory Grants Pending' OPTION (RECOMPILE);

Memory Grants Pending是等待成功获取workspace memory grant的SQL server进程总数，

下面这段文字对这个值做了充分的解释与说明

Memory Grants Pending is the total number of processes within SQL Server

that are waiting for a workspace memory grant. You want this value to

be zero if at all possible. Any sustained value above zero is an extremely

strong indicator of memory pressure. Especially if you see any signs of

internal memory pressure from the previous three queries, take a closer

look at the overall memory usage in SQL Server by running the query shown

in the following code block:

K,获取Memory clerk信息

-- Memory Clerk Usage for instance

-- Look for high value for CACHESTORE_SQLCP (Ad-hoc query plans)

-- CACHESTORE_SQLCP SQL Plans

-- These are cached SQL statements or batches that

-- aren't in stored procedures, functions and triggers

-- CACHESTORE_OBJCP Object Plans

-- These are compiled plans for

-- stored procedures, functions and triggers

-- CACHESTORE_PHDR Algebrizer Trees

-- An algebrizer tree is the parsed SQL text

-- that resolves the table and column names

SELECT TOP(10) [type] AS [Memory Clerk Type],

SUM(pages_kb) AS [SPA Mem, Kb]

FROM sys.dm_os_memory_clerks WITH (NOLOCK)

GROUP BY [type]

ORDER BY SUM(pages_kb) DESC OPTION (RECOMPILE);

L，查询只使用一次的ad-hoc查询语句

-- Find single-use, ad-hoc queries that are bloating the plan cache

-- Gives you the text and size of single-use ad-hoc queries that

-- waste space in the plan cache

-- Enabling 'optimize for ad hoc workloads' for the instance

-- can help (SQL Server 2008 and above only)

-- Enabling forced parameterization for the database can help, but test first!

SELECT TOP(20) [text] AS [QueryText], cp.size_in_bytes

FROM sys.dm_exec_cached_plans AS cp WITH (NOLOCK)

CROSS APPLY sys.dm_exec_sql_text(plan_handle)

WHERE cp.cacheobjtype = N'Compiled Plan'

AND cp.objtype = N'Adhoc'

AND cp.usecounts = 1

ORDER BY cp.size_in_bytes DESC OPTION (RECOMPILE);

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/101629/viewspace-761533/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/101629/viewspace-761533/

ch15,Delivering a SQL Server Health Check--PART 3

相关阅读

相关文章

相关问答