sql server 會記錄IO等待時間超過15 seconds的請求,這時application會有 time out 現象,dba需要判斷是workload,concurrecy 所致還是sql server配置沒有最優導致?二者之間的比例各佔多少?
1:判斷sql server的配置是否最優化,相對容易,可以有以下專案檢查:
a: 通過datafile/logfile放置在不同的drive上,可以分離random io 和 sequence io 操作,可以極大的緩解讀寫IO,對於寫IO,只需立即做logfile 的 sequence io即可,至於何時把data page 真正寫入到datafile,
sql server自己把握。根據我自己的觀察,分離datafile/logfile,一個月有個2到3次的 IO stall, 如果都在同一個磁碟,有個8到9次 IO stall 也很常見。
b: 儘量合理的設定[max server memory (MB)], 一般為總記憶體的7/8即可。
c: tempdb最好在單獨的磁碟上,如果不行的話,也要做datafile/logfile分離。
2:如果出現了問題,那如何判斷是workload導致的呢?第一要做好IO activity的統計工作,第二要通知開發,是否是業務的變化導致IO增多,困難是這個IO stall 到底涉及到哪些表?如何得到?
a: 通過以下的sp, 利用sys.dm_io_virtual_file_stats DMV 記錄每一分鐘sql server IO讀寫操作
USE [master] GO SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO create proc [dbo].[usp_monitor_virtual_file_latency] as begin if OBJECT_ID('master.dbo.tbl_virtual_file_latency','U') is null begin create table master.dbo.tbl_virtual_file_latency ( id int identity(1,1) primary key, collect_dt datetime not null, [database_name] [nvarchar](128) NULL, [file_id] [smallint] NOT NULL, [type_desc] [nvarchar](60) NULL, [sample_ms] [int] NOT NULL, [num_of_reads] [bigint] NOT NULL, [num_of_bytes_read] [bigint] NOT NULL, [io_stall_read_ms] [bigint] NOT NULL, [num_of_writes] [bigint] NOT NULL, [num_of_bytes_written] [bigint] NOT NULL, [io_stall_write_ms] [bigint] NOT NULL, [io_stall] [bigint] NOT NULL, [size_on_disk_bytes] [bigint] NOT NULL, [file_handle] [varbinary](8) NOT NULL, [physical_name] [nvarchar](260) NOT NULL, [state_desc] [nvarchar](60) NULL ) create index idx_collect_dt on tbl_virtual_file_latency(collect_dt) with(online=on) end insert into master.dbo.tbl_virtual_file_latency SELECT --virtual file latency GETDATE(), db_name(vfs.[database_id]) as database_name ,vfs.[file_id] ,mf.[type_desc] ,vfs.[sample_ms] ,vfs.[num_of_reads] ,vfs.[num_of_bytes_read] ,vfs.[io_stall_read_ms] ,vfs.[num_of_writes] ,vfs.[num_of_bytes_written] ,vfs.[io_stall_write_ms] ,vfs.[io_stall] ,vfs.[size_on_disk_bytes] ,vfs.[file_handle] ,mf.[physical_name] ,mf.[state_desc] FROM sys.dm_io_virtual_file_stats (NULL,NULL) AS [vfs] JOIN sys.master_files(nolock) AS [mf] ON [vfs].[database_id] = [mf].[database_id] AND [vfs].[file_id] = [mf].[file_id] WHERE DB_NAME(vfs.database_id) not in('master','model','msdb') --delete old data delete from master.dbo.tbl_virtual_file_latency where collect_dt <DATEADD(MONTH,-2,getdate()) end
b: 通過以下的sp,求每分鐘內每個資料檔案的IO的size和讀寫情況
USE [master] GO SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO create proc [dbo].[usp_calculate_virtual_file_stats_PerMinutes] @physical_name nvarchar(1000), @collect_starttime datetime, @collect_endtime datetime as with snap as ( select row_number() over (order by id asc) as row_num, id ,[collect_dt] ,[database_name] ,[file_id] ,[type_desc] ,[sample_ms] ,[num_of_reads] ,[num_of_bytes_read] ,[io_stall_read_ms] ,[num_of_writes] ,[num_of_bytes_written] ,[io_stall_write_ms] ,[io_stall] from master.dbo.tbl_virtual_file_latency(nolock) AS [vfs1] WHERE vfs1.physical_name=@physical_name and vfs1.collect_dt>=@collect_starttime and vfs1.collect_dt<=@collect_endtime ) select cur.collect_dt, read_io_num= case when (cur.num_of_reads-pre.num_of_reads)=0 then 0 else cur.num_of_reads-pre.num_of_reads end , write_io_num= case when (cur.num_of_writes-pre.num_of_writes)=0 then 0 else cur.num_of_writes-pre.num_of_writes end , read_latency_MS =case when (cur.num_of_reads-pre.num_of_reads)=0 then 0 else cast((cur.io_stall_read_ms-pre.io_stall_read_ms)*1./(cur.num_of_reads-pre.num_of_reads) as decimal(10,1)) end , write_latency_MS= case when (cur.num_of_writes-pre.num_of_writes)=0 then 0 else cast((cur.io_stall_write_ms-pre.io_stall_write_ms)*1./(cur.num_of_writes-pre.num_of_writes) as decimal(10,1)) end , average_latency_MS=case when (cur.num_of_reads-pre.num_of_reads+cur.num_of_writes-pre.num_of_writes)=0 then 0 else cast((cur.io_stall-pre.io_stall)*1./(cur.num_of_reads-pre.num_of_reads+cur.num_of_writes-pre.num_of_writes) as decimal(10,1)) end, averageKB_read=case when (cur.num_of_reads-pre.num_of_reads)=0 then 0 else cast((cur.num_of_bytes_read-pre.num_of_bytes_read)*1./(cur.num_of_reads-pre.num_of_reads)/1024 as decimal(10,1))end , averageKB_write=case when (cur.num_of_writes-pre.num_of_writes)=0 then 0 else cast((cur.num_of_bytes_written-pre.num_of_bytes_written)*1./(cur.num_of_writes-pre.num_of_writes)/1024 as decimal(10,1)) end, averageKB_Transfer=case when (cur.num_of_reads-pre.num_of_reads+cur.num_of_writes-pre.num_of_writes)=0 then 0 else cast((cur.num_of_bytes_read-pre.num_of_bytes_read+cur.num_of_bytes_written-pre.num_of_bytes_written)*1./(cur.num_of_reads-pre.num_of_reads+cur.num_of_writes-pre.num_of_writes)/1024 as decimal(10,1)) end from snap as pre join snap as cur on pre.row_num=cur.row_num-1
c: 最後根據歷史資料來看IO讀寫的趨勢
exec usp_calculate_virtual_file_stats_PerMinutes 'E:\TestDB\test.mdf', '2014-12-14 13:49:01.000', '2014-12-14 15:49:01.000'
3:根據IO stall的資訊找出sql server 讀取哪個表導致了IO Stall.
sql server log text:
SQL Server has encountered 727 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [F:\MSSQL\Test_1.ndf:MSSQL_DBCC7] in database [Test] (7). The OS file handle is 0x0000000000001A40. The offset of the latest long I/O is: 0x000023101f0000
a: 把偏移量0x000023101f0000轉化為十進位制為:150594322432
b: 把該值除以8192即為page number: 150594322432/8192=18383096
c:在通過page number求的table name, 再與開發討論IO的問題.
select object_name(object_id) as tblname,* from sys.partitions (nolock)as sp where sp.partition_id in ( select au.container_id from sys.dm_os_buffer_descriptors(nolock) as obd join sys.allocation_units(nolock) as au on obd.allocation_unit_id=au.allocation_unit_id where obd.page_id=18383096)
d: 如果該page number不再記憶體中的化,該SQL 是沒有結果的。可以使用dbcc page(dbid,fileid,pageid,3)來得到該表的object_id.
結論:IO stall 只可能緩解,是沒法避免的,業務量導致的IO永遠在某一點上是超過儲存的容量的,dba能通過各種事前事後方法到找到root cause, 據此提出解決的方法。希望以上的方法能對大家有所啟發。