block_dump觀察Linux IO寫入的具體檔案

me_lawrence發表於2015-09-21

http://www.oenhan.com/block-dump-linux-io

很多情況下開發者調測程式需要在Linux下獲取具體的IO的狀況,目前常用的IO觀察工具用vmstat和iostat,具體功能上說當然是iostat更勝一籌,在IO統計上時間點上更具體精細。但二者都是在全域性上看到IO,巨集觀上的資料對於判斷IO到哪個檔案上毫無幫助,這個時候block_dump的作用就顯現出來了。

一、使用方法:

需要先停掉syslog功能,因為具體IO資料要通過printk輸出,如果syslog存在,則會往message產生大量IO,干擾正常結果

1
2
suse:~ # service syslog stop
Shutting down syslog services done

然後啟動block_dump

1
suse:~ # echo 1 > /proc/sys/vm/block_dump

先說效果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
suse:~ # dmesg | tail
dmesg(3414): dirtied inode 9594 (LC_MONETARY) on sda1
dmesg(3414): dirtied inode 9238 (LC_COLLATE) on sda1
dmesg(3414): dirtied inode 9241 (LC_TIME) on sda1
dmesg(3414): dirtied inode 9606 (LC_NUMERIC) on sda1
dmesg(3414): dirtied inode 9350 (LC_CTYPE) on sda1
kjournald(506): WRITE block 3683672 on sda1
kjournald(506): WRITE block 3683680 on sda1
kjournald(506): WRITE block 3683688 on sda1
kjournald(506): WRITE block 3683696 on sda1
kjournald(506): WRITE block 3683704 on sda1
kjournald(506): WRITE block 3683712 on sda1
kjournald(506): WRITE block 3683720 on sda1
kjournald(506): WRITE block 3683728 on sda1
kjournald(506): WRITE block 3683736 on sda1
kjournald(506): WRITE block 3683744 on sda1

通過dmesg資訊可以看到IO正在寫那些檔案,有程式號,inode號,檔名和磁碟裝置名;但每個檔案寫了多少呢,僅僅通過dirtied inode就看不出來了,還需要分析WRITE block,後面的數字並不是真正的塊號,而是核心IO層獲取的扇區號,除以8即為塊號,然後根據debugfs工具的icheck和ncheck選項,就可以獲取該檔案系統塊屬於哪個具體檔案,具體請google之。

二、基本原理:

block_dump的原理其實很簡單,核心在IO層根據標誌block_dump在IO提交給磁碟的關口卡主過關的每一個BIO,將它們的資料打出來:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
void submit_bio(int rw, struct bio *bio)
{
     int count = bio_sectors(bio);
 
     bio->bi_rw |= rw;
 
/*
 * If it's a regular read/write or a barrier with data attached,
 * go through the normal accounting stuff before submission.
 */
     if (bio_has_data(bio) && !(rw & REQ_DISCARD)) {
         if (rw & WRITE) {
         count_vm_events(PGPGOUT, count);
     } else {
         task_io_account_read(bio->bi_size);
         count_vm_events(PGPGIN, count);
     }
 
     if (unlikely(block_dump)) {
         char b[BDEVNAME_SIZE];
         printk(KERN_DEBUG "%s(%d): %s block %Lu on %s (%u sectors)n",
              current->comm, task_pid_nr(current),
              (rw & WRITE) ? "WRITE" : "READ",
              (unsigned long long)bio->bi_sector,
              bdevname(bio->bi_bdev, b),
              count);
        }
    }
 
    generic_make_request(bio);
}

具體WRITE block塊號和檔案系統塊號之間的對應關係在submit_bh函式中決定

1
bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);

inode的block_dump實現是通過block_dump___mark_inode_dirty搞定的,這次把關口架在inode髒資料寫回的路上,把每個過關的inode資訊打出來:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
void __mark_inode_dirty(struct inode *inode, int flags)
{
 
if (unlikely(block_dump))
block_dump___mark_inode_dirty(inode);
 
}
 
static noinline void block_dump___mark_inode_dirty(struct inode *inode)
{
if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) {
struct dentry *dentry;
const char *name = "?";
 
dentry = d_find_alias(inode);
if (dentry) {
spin_lock(&dentry->d_lock);
name = (const char *) dentry->d_name.name;
}
printk(KERN_DEBUG
"%s(%d): dirtied inode %lu (%s) on %sn",
current->comm, task_pid_nr(current), inode->i_ino,
name, inode->i_sb->s_id);
if (dentry) {
spin_unlock(&dentry->d_lock);
dput(dentry);
}
 }

三、總結

1.核心由很多合適的關口來截獲獲取的IO資訊,不改動核心,也可以用jprobe搶劫很多東西。

2.debugfs在大量的block–>file轉換過程總太慢,自己用ext2fs寫一個,效率應該能提高很多。


相關文章