Linux File System Change Monitoring Technology、Notifier Technology

Andrew.Hann發表於2015-05-13

catalog

1. 為什麼要監控檔案系統
2: hotplug
3. udev
4. fanotify(fscking all notification system)
5. inotify
6. code example

 

1. 為什麼要監控檔案系統

在日常工作中,人們往往需要知道在某些檔案(夾)上都有那些變化,比如:

1. 通知配置檔案的改變
2. 跟蹤某些關鍵的系統檔案的變化
3. 監控某個分割槽磁碟的整體使用情況
4. 系統崩潰時進行自動清理
5. 自動觸發備份程式
6. 向伺服器上傳檔案結束時發出通知
7. 殺軟(anti-virus)需要對磁碟上的檔案變動進行實時監控,並進行檔案內容查殺
8. 通常使用檔案輪詢的通知機制,但是這種機制只適用於經常改變的檔案(因為它可以確保每過x秒就可以得到i/o),其他情況下都非常低效,並且有時候會丟失某些型別的變化,例如檔案的修改時間沒有改變。像Tripwire這樣的資料完整性系統,它們基於時間排程來跟蹤檔案變化,但是如果想實時監控檔案的變化的話,那麼時間排程就束手無策了 

Relevant Link:

http://www.jiangmiao.org/blog/2179.html
http://www.infoq.com/cn/articles/inotify-linux-file-system-event-monitoring

 

2: hotplug

Hotplug是一種核心向使用者態應用通報關於熱插拔裝置一些事件發生的機制,桌面系統能夠利用它對裝置進行有效的管理,無論何時一個裝置從系統中 增刪, 都產生一個"熱插拔事件". 這意味著核心呼叫使用者空間程式 /sbin/hotplug. 這個程式典型地是一個非常小的 bash 指令碼, 只傳遞執行給一系列其他的位於 /etc/hot-plug.d/ 目錄樹的程式. 對於大部分的 Linux 釋出, 這個指令碼看來如下

DIR="/etc/hotplug.d"
for I in "${DIR}/$1/"*.hotplug "${DIR}/"default/*.hotplug ; do
 if [ -f $I ]; then
 test -x $I && $I $1 ;
 fi
done
exit 1

這個指令碼搜尋所有的有 .hotplug 字尾的可能對這個事件感興趣的程式並呼叫它們, 傳遞給它們許多不同的環境變數, 這些環境變數已經被核心設定

Relevant Link:

http://linux-hotplug.sourceforge.net/?selected=overview
http://oss.org.cn/kernel-book/ldd3/ch14s07.html

 

3. udev

udev是Linux kernel 2.6系列的裝置管理器。它主要的功能是管理/dev目錄底下的裝置節點。它同時也是用來接替devfs及hotplug的功能,這意味著它要在新增/刪除硬體時處理/dev目錄以及所有使用者空間的行為,包括載入firmware時
在傳統的Linux系統中,/dev目錄下的裝置節點為一系列靜態存在的檔案,而udev則動態提供了在系統中實際存在的裝置節點。雖然devfs提供了類似功能,但udev具有以下優點

1. udev支援裝置的固定命名,而並不依賴於裝置插入系統的順序。預設的udev設定提供了儲存裝置的固定命名。可以使用其
    1) vid(vendor)
    2) pid(device)
    3) 裝置名稱(model)等屬性
    4) 或其父裝置的對應屬性來確認某一裝置
2. udev完全在使用者空間執行,而不是像devfs在核心空間一樣執行。結果就是udev將命名策略從核心中移走,並可以在節點建立前用任意程式在裝置屬性中為裝置命名 

0x1: 執行方式

udev是一個通用的核心裝置管理器。它以守護程式的方式執行於Linux系統,並監聽在新裝置初始化或裝置從系統中移除時,核心(通過netlink socket)所發出的uevent
系 統提供了一套規則用於匹配可發現的裝置事件和屬性的匯出值。匹配規則可能命名並建立裝置節點,並執行配置程式來對裝置進行設定。udev規則可以匹配像內 核子系統、核心裝置名稱、裝置的物理等屬性,或裝置序列號的屬性。規則也可以請求外部程式提供資訊來命名裝置,或指定一個永遠一樣的自定義名稱來命名設 備,而不管裝置什麼時候被系統發現

0x2: 系統架構

udev系統可以分為三個部分

1. libudev函式庫: 可以用來獲取裝置的資訊 
2. udevd守護程式: 處於使用者空間,用於管理虛擬/dev
3. 管理命令udevadm: 用來診斷出錯情況 
4. 系統獲取核心通過netlink socket發出的資訊 

0x3: 命令格式

1. BUS 匯流排 KERNEL 核心名如sd* ID 裝置id 如匯流排id PLACE
2. SYSFS{filename} 或 ATTR{filename}
3. PROGRAM 呼叫外部程式 RESULT 匹配program返回的結果 NAME
4. SYMLINK 連線規則

Relevant Link:

http://zh.wikipedia.org/wiki/Udev
https://www.ibm.com/developerworks/cn/linux/l-cn-udev/
https://wiki.archlinux.org/index.php/Udev_(%E7%AE%80%E4%BD%93%E4%B8%AD%E6%96%87)
https://www.suse.com/zh-cn/documentation/sles11/singlehtml/book_sle_admin/cha.udev.html

 

4. fanotify(fscking all notification system)

Fanotify 是一個 notifier,即一種對檔案系統變化產生通知的機制,是替代 inotify 的下一代檔案系統通知機制,Fanotify (fscking all notifiction and file access system) 是一個 notifier,即一種對檔案系統變化產生通知的機制

0x1: fanotify的特性:檔案系統事件通知

作為一個 notifier,最基本的功能是當檔案系統出現變化時通知相應的監控程式,在 Linux 的歷史上,最早由 dnotify 提供這種服務,後來 inotify 起而代之,Fanotify 也提供通知功能

1. FAN_ACCESS: File was accessed
2. FAN_MODIFY: File was modified
3. FAN_CLOSE_WRITE: Writtable file closed
4. FAN_CLOSE_NOWRITE: Unwrittable file closed
5. FAN_OPEN: File was opened
6. FAN_OPEN_PERM: File open in perm check
7. FAN_ACCESS_PERM: File accessed in perm check

0x2: fanotify的特性:全檔案系統監控

Inotify使用watchdescriptor這個資料結構來對應某個被監控的檔案或者目錄。每個需要被監控的檔案系統物件(檔案、目錄)都需要一個wd物件來表示

Fanotify 有三個個基本的模式

1. directed: directed 模式和 inotify 類似,直接工作在被監控的物件的 inode 上,一次只可以監控一個物件。因此需要監控大量目標時也很麻煩
2. per-mount: Per-mount 模式工作在 mount 點上,比如磁碟 /dev/sda2 的 mount 點在 /home,則 /home 目錄下的所有檔案系統變化都可以被監控,這其實可以被看作另外一種 Global 模式
3. global: Global 模式則監控整個檔案系統,任何變化都會通知 Listener。防毒軟體便工作在這種模式下 
/*
需要明白的是
fanotify 依然無法支援 sub-tree 監控。但比 inotify 進了一步的是,fanotify 可以監控某個目錄下的直接子節點。比如可以監控 /home 和他的直接子節點,檔案 /home/foo1,/home/foo2 等都可以被監控,但 /home/pics/foo1 就不可以了,因為 /home/pics/foo1 不是 /home 的直接子節點
*/

0x3: fanotify的特性:訪問控制 Access decision

所 謂 access descision 即當檔案被訪問的時候,監控程式不僅可以獲得這個事件通知,還能夠決定是否允許該操作。這對於防毒軟體是必要的:當您試圖開啟一個含有病毒的檔案 時,fanotify 將產生一個通知給作為 listener 的防毒軟體,這個時候防毒軟體不僅需要判斷將被開啟的檔案是否含有病毒,還需要阻止您的這個不安全的操作

當 app 需要開啟檔案的時候,加入該檔案已經被 AV 程式監控,那麼 open 這個操作將引起 fanotify 的通知,在 VFS 允許 open 返回之前,fanotify 先詢問 AV program,假如允許,則 app 的 open 呼叫成功,否則 app 的 open 呼叫將失敗。這樣就可以阻止應用程式開啟帶病毒的檔案了

0x4: fanotify的特性:Listener groups

Fanotify 允許多個 Listener 同時監控同一個檔案系統物件。比如防毒軟體 V 和桌面搜尋軟體 S 會同時監控目錄 /myDocument。當檔案 /mydocument/test 被開啟的時候,fanotify 將通知 V 和 S,通知的順序遵循Listener groups配置的策略進行
例 如有一類軟體叫做 hierarchical storage manager(HSM),在檔案系統中實際存放的可能只是一個 stub 檔案,檔案真正的內容在下一級儲存裝置中,因此當 stub 檔案被開啟時,fanotify 應該先通知 HSM,讓它先工作,將真正的檔案內容匯入到 stub 檔案中;然後再通知防毒軟體,對真正的檔案內容進行掃描;否則就有這樣的一種可能:防毒檔案只掃描了 stub,而 HSM 隨後將病毒匯入
Fanotify 將所有的 Listener 分成三個 Group,優先順序從上到下遞減

1. FAN_CLASS_PRE_CONTENT: 
初始化為 FAN_CLASS_PRE_CONTENT 的 Listener 優先順序最高,將最先收到通知,FAN_CLASS_PRE_CONTENT 用於 HSM 等需要在應用程式使用檔案的 CONTENT 之前就得到檔案操作權的應用程式

2. FAN_CLASS_CONTENT: 
其後是 FAN_CLASS_CONTENT,FAN_CLASS_CONTENT 適用於防毒軟體等需要檢查檔案 CONTENT 的軟體

3. FAN_CLASS_NOTIF: 
最後才是 FAN_CLASS_NOTIF 程式得到通知,FAN_CLASS_NOTIF 則用於純粹的 notification 軟體,不需要訪問檔案內容的應用程式

0x5: fanotify的特性:Listener PID

呼叫 Inotify 進行監控的程式如果對被監控檔案進行操作,也將引起通知。有時候這會造成問題(例如自身造成的無限遞迴事件觸發)

inotify_add_watch (fd, “/home/lm/loop”, IN_MODIFY | IN_OPEN | IN_CREATE | IN_DELETE); 
// 監控檔案 /home/lm/loop 

for (;;) 
{ 
    readInotifyEvent(); 
    if(event->mask & IN_OPEN) 
        check_what_changed(event); // 檢查有些什麼改動
}

void check_what_changed(event) 
 { 
    fd = open(event->name, O_RDWR); // 又觸發 inotify 通知
    read (fd, buf,128) 
    …
 }
//函式 check_what_changed() 為了檢查檔案內容是否有變化必須呼叫 open 開啟檔案,這裡的 open 操作也會觸發 inotify 通知,從而使得程式碼形成一個無限迴圈

Fanotify 在通知中包含了觸發事件的程式的 Pid,因此上面的問題可以輕易解決:

1. 在 check_what_changed 函式中判斷引起通知的 pid,如果是監控程式自己,則忽略這個通知,不會再次開啟該檔案。從而打破無限迴圈 
2. 實際上,Fanotify 的通知中包含了被監控檔案系統物件的 open fd,應用程式可以直接使用這個 fd 對檔案物件進行操作,而不會引起新的通知,即在收到因為fanotify自身的檔案操作引發的事件通知後,直接使用fd進行操作,而避免後續的遞迴事件,這也是 Fanotify 相對於 Inotify 改進的一個地方 

0x6: fanotify的特性:Decision Cache

殺 毒軟體要掃描每一個即將被訪問的檔案,這對使用者體驗的影響很大。假如一個檔案被頻繁使用,且沒有修改,那麼最好只在第一次訪問的時候掃描它,之後便不再需 要掃描了。類似一個 cache,掃描過的檔案進入這個 cache,下次再訪問同一個檔案時,假如在 cache 中存在,那就不需要再次掃描檔案內容了。
Fanotify 支援這種 cache,也叫做 ignore marks。它的工作原理很簡單,假如對一個檔案系統物件設定了 ignore marks,那麼下次該檔案被訪問時,相應的事件便不會觸發訪問控制的程式碼,從而始終允許該檔案的訪問。
防毒軟體可以這樣使用此特性,當應用程式第一次開啟檔案 file A 時,Fanotify 將通知防毒軟體 AV 進行檔案內容掃描,如果 AV 軟體發現該檔案沒有病毒,在允許本次訪問的同時,對該檔案設定一個 ignore mark。如下圖所示:

此後 File A 再次被訪問的時候,Fanotify 將發現在 cache 中已經有相應的 Ignore Mark,因此不再通知 AV 軟體進行訪問控制而直接允許該檔案的訪問請求

當檔案內容被修改時,Fanotify 將自動清除 Ignore mark。Ignore Mark 的數量預設情況下有一定限制,但使用者可以通過修改 init flag 設定無限的 mark 數目

0x7: Fanotify 的缺點

1. Fanotify 目前支援的檔案系統事件型別比 inotify 少很多
相比 inotify,fanotify 所支援的檔案系統事件少很多,尤其是 fanotify 不支援 move,這使得 fanotify 無法應用於類似桌面搜尋或者實時遠端檔案系統同步等應用。當檔案從一個目錄移動到另一個目錄,或者被改名時,fanotify 不產生任何通知。這使得一些使用 inotify 的應用因此無法遷移到 fanotify 上面來

2. 和 inotify 一樣,目前 fanotify 無法做到 sub-tree 監控

Relevant Link:

https://www.ibm.com/developerworks/cn/linux/l-cn-fanotify/
http://www.lanedo.com/filesystem-monitoring-linux-kernel/
http://www.lanedo.com/users/amorgado/fanotify/fanotify-example-access-control.c
http://www.lanedo.com/users/amorgado/fanotify/fanotify-example-mount.c
http://www.lanedo.com/users/amorgado/fanotify/fanotify-example.c

 

5. inotify

inotify是Linux核心子系統之一,做為檔案系統的附加功能,它可監控檔案系統並將異動通知應用程式。本系統的出現取代了舊有Linux核心裡,擁有類似功能之dnotify模組
inotify的主要應用於

1. 桌面搜尋軟體,像:Beagle,得以針對有變動的檔案重新索引,而不必沒有效率地每隔幾分鐘就要掃描整個檔案系統。相較於主動輪詢檔案系統,通過作業系統主動告知檔案異動的方式,讓Beagle等軟體甚至可以在檔案更動後一秒內更新索引 
2. 更新目錄檢視
3. 重新載入配置檔案
4. 追蹤變更
5. 備份
6. 同步甚至上傳等許多自動化作業流程
7. 相較於被inotify取代較舊的 dnotify模組,inotify有諸多益處。在舊的dnotify模組中,程式必須為每一個被監控的目錄建立file descriptor,這種作法很容易讓程式擁有的file descriptor逼近系統允許的上限,進而形成瓶頸。dnotify產生的file decriptor也會導致系統資源忙碌,使可移除裝置無法被移除,徒增使用上的困擾。
由於dnotify只能讓程式設計師監控目錄層級的變化,"精細度"亦是"dnotify"的劣勢之一。為此,程式設計師必須付出額外的心力,自行撰寫程式碼以期追蹤更細微的檔案系統事件。
inotify相較之下使用較少的file descriptor,亦允許select()與poll()介面,優於dnotify使用的訊號系統。這也使得inotify與既有以select()或poll()為基礎之庫(如:Glib)整合更加便利

0x1: inotify監控事件型別

1. IN_ACCESS: File was accessed (e.g., read(2), execve(2)).
2. IN_ATTRIB: Metadata changed—for example, 
    1) permissions (e.g.,chmod(2))
    2) timestamps (e.g., utimensat(2))
    3) extended attributes (setxattr(2))
    4) link count (since Linux 2.6.25; e.g., for the target of link(2) and for unlink(2))
    5) user/group ID (e.g., chown(2)).
3. IN_CLOSE_WRITE: File opened for writing was closed.
4. IN_CLOSE_NOWRITE: File or directory not opened for writing was closed.
5. IN_CREATE: File/directory created in watched directory (e.g.
    1) open(2) O_CREAT
    2) mkdir(2)
    3) link(2)
    4) symlink(2)
    5) bind(2) on a UNIX domain socket 
6. IN_DELETE: File/directory deleted from watched directory.
7. IN_DELETE_SELF: 
Watched file/directory was itself deleted.  (This event also occurs if an object is moved to another filesystem, since mv(1) in effect copies the file to the other filesystem and then deletes it from the original filesystem.)  In addition, an IN_IGNORED event will subsequently be generated for the watch descriptor.
8. IN_MODIFY: File was modified (e.g., write(2), truncate(2)).
9. IN_MOVE_SELF: Watched file/directory was itself moved.
10. IN_MOVED_FROM: Generated for the directory containing the old filename when a file is renamed.
11. IN_MOVED_TO: Generated for the directory containing the new filename when a file is renamed.
12. IN_OPEN: File or directory was opened.
//IN_ALL_EVENTS: macro is defined as a bit mask of all of the above events.
13. IN_MOVE: Equates to IN_MOVED_FROM | IN_MOVED_TO.
14. IN_CLOSE: Equates to IN_CLOSE_WRITE | IN_CLOSE_NOWRITE.
15. IN_DONT_FOLLOW: Don't dereference pathname if it is a symbolic link.
16. IN_EXCL_UNLINK: events are not generated for children after they have been unlinked from the watched directory.
17. IN_MASK_ADD: If a watch instance already exists for the filesystem object corresponding to pathname, add (OR) the events in mask to the watch mask (instead of replacing the mask)
18. IN_ONESHOT: Monitor the filesystem object corresponding to pathname for one event, then remove from watch list.
19. IN_ONLYDIR: Only watch pathname if it is a directory
20. IN_IGNORED: Watch was removed explicitly (inotify_rm_watch(2)) or automatically (file was deleted, or filesystem was unmounted)
21. IN_ISDIR: Subject of this event is a directory.
22. IN_Q_OVERFLOW: Event queue overflowed
23. IN_UNMOUNT: Filesystem containing watched object was unmounted.  In addition, an IN_IGNORED event will subsequently be generated for the watch descriptor.

0x2: Examples

1. Suppose an application is watching the directory dir and the file dir/myfile for all events.  The examples below show some events that will be generated for these two objects.
fd = open("dir/myfile", O_RDWR);
    Generates IN_OPEN events for both dir and dir/myfile.
read(fd, buf, count);
    Generates IN_ACCESS events for both dir and dir/myfile.
write(fd, buf, count);
    Generates IN_MODIFY events for both dir and dir/myfile.
fchmod(fd, mode);
    Generates IN_ATTRIB events for both dir and dir/myfile.
close(fd);
    Generates IN_CLOSE_WRITE events for both dir and dir/myfile.


2. Suppose an application is watching the directories dir1 and dir2, and the file dir1/myfile.  The following examples show some events that may be generated.
link("dir1/myfile", "dir2/new");
    Generates an IN_ATTRIB event for myfile and an IN_CREATE event for dir2.
rename("dir1/myfile", "dir2/myfile");
    Generates an IN_MOVED_FROM event for dir1, an IN_MOVED_TO event for dir2, and an IN_MOVE_SELF event for myfile. The IN_MOVED_FROM and IN_MOVED_TO events will have the same cookie value.

3. Suppose that dir1/xx and dir2/yy are (the only) links to the same file, and an application is watching dir1, dir2, dir1/xx, and dir2/yy.  Executing the following calls in the order given below will generate the following events:
unlink("dir2/yy");
    Generates an IN_ATTRIB event for xx (because its link count changes) and an IN_DELETE event for dir2.
unlink("dir1/xx");
    Generates IN_ATTRIB, IN_DELETE_SELF, and IN_IGNORED events for xx, and an IN_DELETE event for dir1.

4. Suppose an application is watching the directory dir and (the empty) directory dir/subdir.  The following examples show some events that may be generated.
mkdir("dir/new", mode);
    Generates an IN_CREATE | IN_ISDIR event for dir.
rmdir("dir/subdir");
    Generates IN_DELETE_SELF and IN_IGNORED events for subdir, and an IN_DELETE | IN_ISDIR event for dir.

0x3: 配置介面/proc interfaces

The following interfaces can be used to limit the amount of kernel memory consumed by inotify:

1. /proc/sys/fs/inotify/max_queued_events
The value in this file is used when an application calls inotify_init(2) to set an upper limit on the number of events that can be queued to the corresponding inotify instance.
Events in excess of this limit are dropped, but an IN_Q_OVERFLOW event is always generated.

2. /proc/sys/fs/inotify/max_user_instances
This specifies an upper limit on the number of inotify instances that can be created per real user ID.

3. /proc/sys/fs/inotify/max_user_watches
This specifies an upper limit on the number of watches that can be created per real user ID.
//需要特別注意的是,inotify對磁碟變動事件的是存在限制的,對於inotify來說,每一個目錄就是一個"watches",linux/windows對max watches都是有個數限制的,因為這會佔用記憶體,從理論上來說,inotify無法做到100%的目錄監控,除非採用核心態的檔案系統變動監控

0x4: Limitations and caveats

1. The inotify API provides no information about the user or process that triggered the inotify event.  In particular, there is no easy way for a process that is monitoring events via inotify to distinguish events that it triggers itself from those that are triggered by other processes.

2. Inotify reports only events that a user-space program triggers through the filesystem API.  As a result, it does not catch remote events that occur on network filesystems.  (Applications must fall back to polling the filesystem to catch such events.)  Furthermore, various pseudo-filesystems such as /proc, /sys, and /dev/pts are not monitorable with inotify.

3. The inotify API does not report file accesses and modifications that may occur because of mmap(2), msync(2), and munmap(2).

4. The inotify API identifies affected files by filename.  However, by the time an application processes an inotify event, the filename may already have been deleted or renamed. 這也是任何主機檔案變動監控都會遇到的一個技術難題,可以考慮的解決的方案有block阻斷刪除

4. The inotify API identifies events via watch descriptors.  It is the application's responsibility to cache a mapping (if one is needed) between watch descriptors and pathnames.  Be aware that directory renamings may affect multiple cached pathnames.

5. Inotify monitoring of directories is not recursive: to monitor subdirectories under a directory, additional watches must be created. This can take a significant amount time for large directory trees.

6. If monitoring an entire directory subtree, and a new subdirectory is created in that tree or an existing directory is renamed into that tree, be aware that by the time you create a watch for the new subdirectory, new files (and subdirectories) may already exist inside the subdirectory.  Therefore, you may want to scan the contents of the subdirectory immediately after adding the watch (and, if desired, recursively add watches for any subdirectories that it contains).

7. Note that the event queue can overflow.  In this case, events are lost.  Robust applications should handle the possibility of lost events gracefully.  For example, it may be necessary to rebuild part or all of the application cache.  (One simple, but possibly expensive, approach is to close the inotify file descriptor, empty the cache, create a new inotify file descriptor, and then re-create watches and cache entries for the objects to be monitored.)

0x5: 核心實現原理

在核心中,每一個 inotify 例項對應一個 inotify_device 結構
/source/fs/notify/inotify/inotify_user.c

struct inotify_device 
{
    /* 
    wait queue for i/o 
    wq 是等待佇列,被 read 呼叫阻塞的程式將掛在該等待佇列上
    */
    wait_queue_head_t       wq;    
    
    struct mutex            ev_mutex;       /* protects event queue */
    struct mutex            up_mutex;       /* synchronizes watch updates */

    /* 
    list of queued events 
    events 為該 inotify 例項上發生的事件的列表,被該 inotify 例項監視的所有事件在發生後都將插入到這個列表
    */
    struct list_head        events;         
    
    /* 
    user who opened this dev 
    user 用於描述建立該 inotify 例項的使用者
    */
    struct user_struct      *user;          
    struct inotify_handle   *ih;            /* inotify handle */
    struct fasync_struct    *fa;            /* async notification */

    /* 
    reference count 
    count 是引用計數
    */
    atomic_t                count;   
    
    /* 
    size of the queue (bytes) 
    queue_size 表示該 inotify 例項的事件佇列的位元組數
    */
    unsigned int            queue_size; 
    
    /* 
    number of pending events 
    event_count 是 events 列表的事件數
    */
    unsigned int            event_count;    

    /* 
    maximum number of events 
    max_events 為最大允許的事件數
    */
    unsigned int            max_events;     
};

每一個 watch 對應一個 inotify_watch 結構
/source/linux/include/linux/inotify.h

struct inotify_watch 
{
    struct list_head        h_list; /* entry in inotify_handle's list */
    struct list_head        i_list; /* entry in inode's list */
    atomic_t                count;  /* reference count */
    struct inotify_handle   *ih;    /* associated inotify handle */
    struct inode            *inode; /* associated inode */
    __s32                   wd;     /* watch descriptor */
    __u32                   mask;   /* event mask for this watch */
};

結構 inotify_device 在使用者態呼叫 inotify_init() 時建立,當關閉 inotify_init()返回的檔案描述符時將被釋放
無論是目錄還是檔案,在核心中都對應一個 inode 結構,inotify 系統在 inode 結構中增加了兩個欄位

struct inode 
{   
    ...
    #ifdef CONFIG_INOTIFY
        /* 
        watches on this inode 
        inotify_watches 是在被監視目標上的 watch 列表,每當使用者呼叫 inotify_add_watch()時,核心就為新增的 watch 建立一個 inotify_watch 結構,並把它插入到被監視目標對應的 inode 的 inotify_watches 列表
        */
        struct list_head    inotify_watches;    
        
        /* 
        protects the watches list 
        inotify_mutex用於同步對 inotify_watches 列表的訪問
        */
        struct mutex        inotify_mutex;    
    #endif
    ...
}

對於inotify的架構需要明白的是,檔案變動監控需要核心和使用者態應用程式的同時支援,Linux核心程式碼在檔案系統這一層面原生支援了變動的通知,即所有的檔案系統操作的程式碼流程中都序列地插入了inotify的通知程式碼
當檔案系統發生"監控事件"之一時,相應的檔案系統程式碼將顯示呼叫fsnotify_* 來把相應的事件報告給 inotify 系統,其中*號就是相應的事件名,目前實現包括

1. fsnotify_move: 檔案從一個目錄移動到另一個目錄
2. fsnotify_nameremove: 檔案從目錄中刪除
3. fsnotify_inoderemove: 自刪除
4. fsnotify_create: 建立新檔案
5. fsnotify_mkdir: 建立新目錄
6. fsnotify_access: 檔案被讀
7. fsnotify_modify: 檔案被寫
8. fsnotify_open: 檔案被開啟
9. fsnotify_close: 檔案被關閉
10. fsnotify_xattr: 檔案的擴充套件屬性被修改
11. fsnotify_change: 檔案被修改或原資料被修改
12. inotify_unmount_inodes: 它是一個例外,它會在檔案系統被 umount 時呼叫來通知 umount 事件給 inotify 系統 

以上提到函式最後都呼叫 inotify_inode_queue_event(inotify_unmount_inodes直接呼叫 inotify_dev_queue_event)
/source/fs/notify/inotify/inotify.c

/**
 * inotify_inode_queue_event - queue an event to all watches on this inode
 * @inode: inode event is originating from
 * @mask: event mask describing this event
 * @cookie: cookie for synchronization, or zero
 * @name: filename, if any
 * @n_inode: inode associated with name
 */
void inotify_inode_queue_event(struct inode *inode, u32 mask, u32 cookie, const char *name, struct inode *n_inode)
{
    struct inotify_watch *watch, *next;

    //判斷對應的inode是否被監視,這通過檢視 inotify_watches 列表是否為空來實現
    if (!inotify_inode_watched(inode))
        return;

    mutex_lock(&inode->inotify_mutex);
    //遍歷 inotify_watches 列表,看是否當前的檔案操作事件被某個 watch 監視(當前inode結點上的inotify_watches)
    list_for_each_entry_safe(watch, next, &inode->inotify_watches, i_list) 
    {
        u32 watch_mask = watch->mask;
        if (watch_mask & mask) 
        {
            struct inotify_handle *ih= watch->ih;
            mutex_lock(&ih->mutex);
            if (watch_mask & IN_ONESHOT)
                remove_watch_no_event(watch, ih);
            ih->in_ops->handle_event(watch, watch->wd, mask, cookie, name, n_inode);
            mutex_unlock(&ih->mutex);
        }
    }
    mutex_unlock(&inode->inotify_mutex);
}
EXPORT_SYMBOL_GPL(inotify_inode_queue_event);

inotify是以group呼叫鏈的形式進行事件通知的,所有的watch點都放置在這個group上
/source/include/linux/fsnotify_backend.h

/*
 * A group is a "thing" that wants to receive notification about filesystem
 * events.  The mask holds the subset of event types this group cares about.
 * refcnt on a group is up to the implementor and at any moment if it goes 0
 * everything will be cleaned up.
 */
struct fsnotify_group 
{
    /*
     * global list of all groups receiving events from fsnotify.
     * anchored by fsnotify_groups and protected by either fsnotify_grp_mutex
     * or fsnotify_grp_srcu depending on write vs read.
     */
    struct list_head group_list;

    /*
     * Defines all of the event types in which this group is interested.
     * This mask is a bitwise OR of the FS_* events from above.  Each time
     * this mask changes for a group (if it changes) the correct functions
     * must be called to update the global structures which indicate global
     * interest in event types.
     */
    __u32 mask;

    /*
     * How the refcnt is used is up to each group.  When the refcnt hits 0
     * fsnotify will clean up all of the resources associated with this group.
     * As an example, the dnotify group will always have a refcnt=1 and that
     * will never change.  Inotify, on the other hand, has a group per
     * inotify_init() and the refcnt will hit 0 only when that fd has been
     * closed.
     */
    atomic_t refcnt;        /* things with interest in this group */
    unsigned int group_num;        /* simply prevents accidental group collision */

    /* 
    how this group handles things 
    這是我們重點要關注的成員
    */
    const struct fsnotify_ops *ops;    

    /* needed to send notification to userspace */
    struct mutex notification_mutex;    /* protect the notification_list */
    struct list_head notification_list;    /* list of event_holder this group needs to send to userspace */
    wait_queue_head_t notification_waitq;    /* read() on the notification file blocks on this waitq */
    unsigned int q_len;            /* events on the queue */
    unsigned int max_events;        /* maximum events allowed on the list */

    /* stores all fastapth entries assoc with this group so they can be cleaned on unregister */
    spinlock_t mark_lock;        /* protect mark_entries list */
    atomic_t num_marks;        /* 1 for each mark entry and 1 for not being
                     * past the point of no return when freeing
                     * a group */
    struct list_head mark_entries;    /* all inode mark entries for this group */

    /* prevents double list_del of group_list.  protected by global fsnotify_grp_mutex */
    bool on_group_list;

    /* groups can define private fields here or use the void *private */
    union {
        void *private;
#ifdef CONFIG_INOTIFY_USER
        struct inotify_group_private_data {
            spinlock_t    idr_lock;
            struct idr      idr;
            u32             last_wd;
            struct fasync_struct    *fa;    /* async notification */
            struct user_struct      *user;
        } inotify_data;
#endif
    };
};

我們重點關注const struct fsnotify_ops *ops;

/*
 * Each group much define these ops.  The fsnotify infrastructure will call
 * these operations for each relevant group.
 *
 * should_send_event - given a group, inode, and mask this function determines
 *        if the group is interested in this event.
 * handle_event - main call for a group to handle an fs event
 * free_group_priv - called when a group refcnt hits 0 to clean up the private union
 * freeing-mark - this means that a mark has been flagged to die when everything
 *        finishes using it.  The function is supplied with what must be a
 *        valid group and inode to use to clean up.
 */
struct fsnotify_ops 
{
    bool (*should_send_event)(struct fsnotify_group *group, struct inode *inode, __u32 mask);
    int (*handle_event)(struct fsnotify_group *group, struct fsnotify_event *event);
    void (*free_group_priv)(struct fsnotify_group *group);
    void (*freeing_mark)(struct fsnotify_mark_entry *entry, struct fsnotify_group *group);
    void (*free_event_priv)(struct fsnotify_event_private_data *priv);
};

0x6: IN_CLOSE_WRITE 事件監控核心態實現原理

/source/fs/open.c

/*
 * Careful here! We test whether the file pointer is NULL before
 * releasing the fd. This ensures that one clone task can't release
 * an fd while another clone is opening it.
 */
SYSCALL_DEFINE1(close, unsigned int, fd)
{
    struct file * filp;
    struct files_struct *files = current->files;
    struct fdtable *fdt;
    int retval;

    spin_lock(&files->file_lock);
    /*
    獲取指向struct fdtable結構體的指標
    \linux-2.6.32.63\include\linux\fdtable.h
    #define files_fdtable(files) (rcu_dereference((files)->fdt))
    */
    fdt = files_fdtable(files);
    if (fd >= fdt->max_fds)
    {
        goto out_unlock;
    } 
    //獲取需要關閉的檔案描述符編號
    filp = fdt->fd[fd];
    if (!filp)
    {
        goto out_unlock;
    } 
    /*
    將fd_array[]中的的指定元素值置null 
    */
    rcu_assign_pointer(fdt->fd[fd], NULL);
    FD_CLR(fd, fdt->close_on_exec); 
    /*
    呼叫__put_unused_fd函式,將當前fd回收,則下一次開啟新的檔案又可以用這個fd了
    static void __put_unused_fd(struct files_struct *files, unsigned int fd)
    {
        struct fdtable *fdt = files_fdtable(files);
        __FD_CLR(fd, fdt->open_fds);
        if (fd < files->next_fd)
        {
            files->next_fd = fd;
        } 
    }
    */
    __put_unused_fd(files, fd);
    spin_unlock(&files->file_lock);
    retval = filp_close(filp, files);

    /* can't restart close syscall because file table entry was cleared */
    if (unlikely(retval == -ERESTARTSYS || retval == -ERESTARTNOINTR || retval == -ERESTARTNOHAND || retval == -ERESTART_RESTARTBLOCK))
    {
        retval = -EINTR;
    } 

    return retval;

out_unlock:
    spin_unlock(&files->file_lock);
    return -EBADF;
}
EXPORT_SYMBOL(sys_close);

retval = filp_close(filp, files);

/*
 * "id" is the POSIX thread ID. We use the
 * files pointer for this..
 */
int filp_close(struct file *filp, fl_owner_t id)
{
    int retval = 0;

    if (!file_count(filp)) 
    {
        printk(KERN_ERR "VFS: Close: file count is 0\n");
        return 0;
    }

    if (filp->f_op && filp->f_op->flush)
    {
        retval = filp->f_op->flush(filp, id);
    } 

    dnotify_flush(filp, id);
    locks_remove_posix(filp, id);
    fput(filp);
    return retval;
}
EXPORT_SYMBOL(filp_close);

fput(filp);
/source/fs/file_table.c

void fput(struct file *file)
{
    if (atomic_long_dec_and_test(&file->f_count))
        __fput(file);
}
EXPORT_SYMBOL(fput);

/* __fput is called from task context when aio completion releases the last
 * last use of a struct file *.  Do not use otherwise.
 */
void __fput(struct file *file)
{
    struct dentry *dentry = file->f_path.dentry;
    struct vfsmount *mnt = file->f_path.mnt;
    struct inode *inode = dentry->d_inode;

    might_sleep();

    //inotify核心通知點
    fsnotify_close(file);
    /*
     * The function eventpoll_release() should be the first called
     * in the file cleanup chain.
     */
    eventpoll_release(file);
    locks_remove_flock(file);

    if (unlikely(file->f_flags & FASYNC)) {
        if (file->f_op && file->f_op->fasync)
            file->f_op->fasync(-1, file, 0);
    }
    if (file->f_op && file->f_op->release)
        file->f_op->release(inode, file);

    //LSM Hook點
    security_file_free(file);

    ima_file_free(file);
    if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL))
        cdev_put(inode->i_cdev);
    fops_put(file->f_op);
    put_pid(file->f_owner.pid);
    file_kill(file);
    if (file->f_mode & FMODE_WRITE)
        drop_file_write_access(file);
    file->f_path.dentry = NULL;
    file->f_path.mnt = NULL;
    file_free(file);
    dput(dentry);
    mntput(mnt);
}

fsnotify_close(file);
\linux-2.6.32.63\include\linux\fsnotify.h

/*
 * fsnotify_close - file was closed
 */
static inline void fsnotify_close(struct file *file)
{
    struct dentry *dentry = file->f_path.dentry;
    struct inode *inode = dentry->d_inode;
    fmode_t mode = file->f_mode;
    //判斷關閉方式
    __u32 mask = (mode & FMODE_WRITE) ? FS_CLOSE_WRITE : FS_CLOSE_NOWRITE;

    if (S_ISDIR(inode->i_mode))
        mask |= FS_IN_ISDIR;

    inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

    fsnotify_parent(dentry, mask);
    fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE, NULL, 0);
}

Relevant Link:

http://www.ibm.com/developerworks/cn/linux/l-inotifynew/

 

6. code example

#include <errno.h>
       #include <poll.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <sys/inotify.h>
       #include <unistd.h>

       /* Read all available inotify events from the file descriptor 'fd'.
          wd is the table of watch descriptors for the directories in argv.
          argc is the length of wd and argv.
          argv is the list of watched directories.
          Entry 0 of wd and argv is unused. */

       static void
       handle_events(int fd, int *wd, int argc, char* argv[])
       {
           /* Some systems cannot read integer variables if they are not
              properly aligned. On other systems, incorrect alignment may
              decrease performance. Hence, the buffer used for reading from
              the inotify file descriptor should have the same alignment as
              struct inotify_event. */

           char buf[4096]
               __attribute__ ((aligned(__alignof__(struct inotify_event))));
           const struct inotify_event *event;
           int i;
           ssize_t len;
           char *ptr;

           /* Loop while events can be read from inotify file descriptor. */

           for (;;) {

               /* Read some events. */

               len = read(fd, buf, sizeof buf);
               if (len == -1 && errno != EAGAIN) {
                   perror("read");
                   exit(EXIT_FAILURE);
               }

               /* If the nonblocking read() found no events to read, then
                  it returns -1 with errno set to EAGAIN. In that case,
                  we exit the loop. */

               if (len <= 0)
                   break;

               /* Loop over all events in the buffer */

               for (ptr = buf; ptr < buf + len;
                       ptr += sizeof(struct inotify_event) + event->len) {

                   event = (const struct inotify_event *) ptr;

                   /* Print event type */

                   if (event->mask & IN_OPEN)
                       printf("IN_OPEN: ");
                   if (event->mask & IN_CLOSE_NOWRITE)
                       printf("IN_CLOSE_NOWRITE: ");
                   if (event->mask & IN_CLOSE_WRITE)
                       printf("IN_CLOSE_WRITE: ");

                   /* Print the name of the watched directory */

                   for (i = 1; i < argc; ++i) {
                       if (wd[i] == event->wd) {
                           printf("%s/", argv[i]);
                           break;
                       }
                   }

                   /* Print the name of the file */

                   if (event->len)
                       printf("%s", event->name);

                   /* Print type of filesystem object */

                   if (event->mask & IN_ISDIR)
                       printf(" [directory]\n");
                   else
                       printf(" [file]\n");
               }
           }
       }

       int
       main(int argc, char* argv[])
       {
           char buf;
           int fd, i, poll_num;
           int *wd;
           nfds_t nfds;
           struct pollfd fds[2];

           if (argc < 2) {
               printf("Usage: %s PATH [PATH ...]\n", argv[0]);
               exit(EXIT_FAILURE);
           }

           printf("Press ENTER key to terminate.\n");

           /* Create the file descriptor for accessing the inotify API */

           fd = inotify_init1(IN_NONBLOCK);
           if (fd == -1) {
               perror("inotify_init1");
               exit(EXIT_FAILURE);
           }

           /* Allocate memory for watch descriptors */

           wd = calloc(argc, sizeof(int));
           if (wd == NULL) {
               perror("calloc");
               exit(EXIT_FAILURE);
           }

           /* Mark directories for events
              - file was opened
              - file was closed */

           for (i = 1; i < argc; i++) {
               wd[i] = inotify_add_watch(fd, argv[i],
                                         IN_OPEN | IN_CLOSE);
               if (wd[i] == -1) {
                   fprintf(stderr, "Cannot watch '%s'\n", argv[i]);
                   perror("inotify_add_watch");
                   exit(EXIT_FAILURE);
               }
           }

           /* Prepare for polling */

           nfds = 2;

           /* Console input */

           fds[0].fd = STDIN_FILENO;
           fds[0].events = POLLIN;

           /* Inotify input */

           fds[1].fd = fd;
           fds[1].events = POLLIN;

           /* Wait for events and/or terminal input */

           printf("Listening for events.\n");
           while (1) {
               poll_num = poll(fds, nfds, -1);
               if (poll_num == -1) {
                   if (errno == EINTR)
                       continue;
                   perror("poll");
                   exit(EXIT_FAILURE);
               }

               if (poll_num > 0) {

                   if (fds[0].revents & POLLIN) {

                       /* Console input is available. Empty stdin and quit */

                       while (read(STDIN_FILENO, &buf, 1) > 0 && buf != '\n')
                           continue;
                       break;
                   }

                   if (fds[1].revents & POLLIN) {

                       /* Inotify events are available */

                       handle_events(fd, wd, argc, argv);
                   }
               }
           }

           printf("Listening for events stopped.\n");

           /* Close inotify file descriptor */

           close(fd);

           free(wd);
           exit(EXIT_SUCCESS);
       }

Relevant Link:

http://linux.die.net/man/7/inotify
http://man7.org/linux/man-pages/man7/inotify.7.html
http://www.ibm.com/developerworks/cn/linux/l-inotifynew/

 

Copyright (c) 2015 LittleHann All rights reserved

 

相關文章