論文閱讀 第一篇:mutual context model

山城過雨發表於2017-04-15

論文閱讀第一篇:Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses

論文閱讀的一般方法

  1. 多遍閱讀
    • 第一遍有個大概的認識
      先看title,再看abstract,introduction,再看看conclusion,最後看看論文中圖片,看看輸入輸出。能用三兩句話總結歸納這篇論文在做什麼工作。
    • 第二遍看論文最核心的思想
      在related work中可以看到這篇論文和其他論文的區別;重要的部分,作者一定會在論文中反覆強調;要重點放在這篇論文的學術貢獻點上
    • 第三遍著眼於細節
      最好的理解細節的方法就是去重現這篇論文中提出的方法

2.注意要點

  • 列印版
    儘管有PDF可以用,但使用列印版會讓你注意力更集中,也更方便做筆記,關鍵點,不明白的學術詞彙等都是剛開始閱讀論文是需要認真標註的地方。

  • 引數
    如果需要去重現論文的話,引數是需要特別留意的地方,作者的方法是否有隱藏的引數。

  • 關聯論文的區別
    前面已經發了一片相關領域的paper,為什麼這一篇還可以發,這一篇paper做的工作(在前人的基礎),是方法上的創新還是有什麼其他方面的貢獻,這是自己需要留意總結學習的地方。


  1. 批判思維

    • 不要迷信權威
      research paper難免都會有漏洞,但是作者不可能自己把漏洞寫在paper裡,limitation裡寫的也頂多是不痛不癢的小問題。並不是這篇論文的作者特別厲害,所以他的論文就無懈可擊。認真研究,如果你發現了論文中的漏洞,你在作者的基礎上去完善這個工作,那你也可以發一篇paper了。

    • 發現關鍵問題的基礎
      初期你會有各種各樣的問題,但都不是關鍵的問題,也不是這篇paper的問題,而是你的問題。你還沒有完全理解這篇paper,答案已經在其中,只是你還沒有理解,或者是你的基礎不夠。你去認真踏實地解決你遇到的問題,不斷地提升,慢慢你就能提出一些關鍵的,點子上的問題。

本篇論文

Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses

這是發表在2012年IEEE TPAMI,作者是大名鼎鼎的Feifei Li和她的學生,Bangpeng Yao。
原文連結在這裡。這是我一門專題研討課需要做presentation的paper,所以讀得比較久,多少有點熟悉吧。

閱讀的方法採用的是開頭提到的多遍閱讀法,因為自己對影象識別這塊沒有什麼基礎知識,所以看起來還是比較吃力的,剛開始嘗試翻譯了abstract和introduction。

然後將重點放在了model上,最後看了experiment。

因為時間比較緊張,還要準備presentation的slide和稿子,所以對paper的一些細節還是一知半解,這裡簡單地說一下這篇論文的主要思想和自己對paper 提出的mutual context model的理解。

摘要

在2D影象中,混亂場景中的物體檢測和人體關節估計是計算機視覺領域的兩大挑戰。這個難點在包含人和物體的互動的活動(比如網球運動)中,就顯得特別明顯,在這樣的活動中,相關的物體往往偏小或者只有部分可見,並且人的身體部分經常會有自遮擋(self-occluded)的部分(Occlusion means that there is something you want to see, but can’t due to some property of your sensor setup, or some event. )。然而,我們注意到物體和人的姿勢可以相互充當相互場境(mutual context),識別其中一個有助於識別另外一個。在本文中,我們提出了一種相互場境模型來為一個人和物體互動活動中的物體和人的姿勢共同建模。在我們的方法中,物體的檢測為更好的人體姿勢估計提供了很強的前提,而人體姿勢的估計則又改善了和人互動的物體的檢測的精確度。在一個六種運動的資料集和一個24種人和樂器互動的資料集下,我們展示了我們的相互場境模型在十分複雜的物體檢測和人體姿勢估計,以及人和物體互動活動的分類上,優勝於現有的最佳技術(state-of-the-art)。

上面是自己剛開始讀時寫的翻譯,有很多不準確的地方,但大致能瞭解這篇paper的主要工作。在影象識別的兩個主要任務,human pose estimation 和 object detection中,存在著自遮擋這樣的難題,傳統的方法不能很好地解決這些問題。本文提出了一種mutual context的模型,它充分利用了human pose 和 object兩者的相互關係,讓它們可以在識別時互為context,從而使得這兩者的識別工作都得到了改善。


介紹

使用場境(context)來增加視覺識別近來受到了越來越多的關注。心理試驗表明場境(context)在人類視覺系統識別中有很重要的作用。在計算機視覺中,場境(context)已經被用在諸如物體的識別和檢測,場景的識別,動作分類以及影象分割等方面。儘管很清楚的,使用場境(context)(來進行視覺識別)的想法是很好的,但是觀察顯示絕大多數的場境(context)資訊對於增強識別任務的效果沒有多大的貢獻。在最近的Pascal VOC(pattern analysis,statistical modelling and computational learning visual object classes)挑戰賽中,基於場境(context)和基於滑動視窗的方法在物體檢測(比如檢測自行車)上的差別僅僅在一個很小的3%-4%的差額內。我們認為,一個可以解釋這種如此小的差額的原因是,強場境(strong context)的缺失。儘管去檢測一輛公路上的車子是容易的,不管車子是否在路上,汽車的強檢測子都可以很高準確度地對汽車進行檢測。實際上,對於人的視覺系統來說,檢測場景中視覺上的反常之處對於生存和社會活動都是至關重要的(比如檢測一隻在冰箱中的貓和一個機場裡無人看管的包裹)。所以場景(的重要性,作用)是不是被誇大了呢?我們的答案是否定的。許多重要的視覺識別任務都都很依賴於場景。一個這樣的情況是,人和物體互動(HOI)的活動中人體姿勢的估計和物體的檢測。正如圖一中所展示的,這兩大難題(的解決)可以通過互相充當場景來獲得極大的收益。如果不知道(圖一中)這個人正在用板球拍進行防禦性擊球的話,就不太容易準確地估計這個運動員的姿勢;如果不是看到這個運動員的姿勢的話,也很難去檢測一個甚至對於人眼來說都是幾乎不可見的運動員手中的小球。在本文中,我們提出為HOI活動中的物體和人體姿勢之間的相互環境建模,如此其中一個可以幫助另一個的識別。特別地,在我們地相互場景模型中,兩個場景的資訊也被考慮到。同現場景為每個活動中物體和特定型別的人體姿勢之間的同現統計建模。圖六中的人體姿勢型別被稱為“原子姿勢”,(圖六中的人體姿勢)可以理解為是人體姿勢的字典,其中人體姿勢被符合身體部分相似形態的相同的原子姿態所表示。我們也注意到空間場景,它為物體和不同的人體部分的空間關係建模。我們展示了我們的演算法在一個六類運動和一個二十四種人和樂器互動(PPMI)的資料集下,極大地改善了物體檢測和人體姿勢估計兩方面的效能。此外,將物體檢測和人體姿勢估計放到一起,我們的方法在HOI運動分類上也達到了較高的準確度。物體和人體姿勢地相互環境建模有其心理學的理論基礎。在參考文獻17和18中,展示了當物體被顯示時,人對於人體姿勢有更好的洞察。在參考文獻19中,作者進一步確認了在HOI活動中物體和人體姿勢的空間和功能性關係。在我們的工作中,我們明確地為這些關係建模,使得物體和人體姿勢的識別可以相互獲益。這使得我們的方法和之前大多數活動識別方法都不同,比如在參考文獻20,21,22,23中,活動識別被視作單純的影象或者視訊分類問題,而沒有具體地分析這些活動中涉及到的物體和人體姿勢。文章剩餘部分的按照下面的方式進行組織。第二部分描述了相關工作,第三,四,五部分分別詳細地說明了我們模型的細節,模型的學習以及結論。第六部分給出了實驗結果。第七部分做了文章總結。

introduction 部分主要介紹這篇論文的整體組織,以及mutual context model更詳細的內容,閱讀完此部分,基本上應該對這篇paper有個大概地認識。


下面的內容將根據我的presentation順序來介紹一下這篇paper

background

interaction

You can conceive a lot of examples of interaction if you want,because the interaction is really common in our life.

Given a image contains human-object interaction,we may want to know what the human-object interaction activity is. Namely we want to do recognition for the image.Is it depicts A man is playing basket ball or a man is playing the musical instrument…That is what we called “activity classification”. Besides the classification,for a deeper understanding for a image,we also want to know some details.By classification,we know it’s a class of tennis forehand activity.We would like to know the different parts of the body as well as the location of these body parts.This is the work,”human pose estimation”.we also want to know the object that the person is interacting with.And this is the work,”object detection”.


Although the human pose estimation and the object detection has been widely studied for so many years in computer vision,they are still very changing problems due to the following reasons.
這裡寫圖片描述

human pose estimation

When we do the human pose estimation,sometime,the unusual part appearance,the self-occlusion and the image region which looks like very similar to a body part can make our work become changing.As you can see in this image,a typical state-of-the-art pose analysis algorithm may give the inaccurate or even the wrong results to you.

But suppose we already know the man is playing baseball and the location of the bat,it should facilitate the human pose estimation because of the strong prior for the pose.It is easy to explain.For instance,if the bat is here,intuitively, it may be a baseball forehand pose ,but not other.


object detection

It is the same to the object detection,which some problems may hinder our work just like in estimation of human pose.Sometimes the object in the image can be very small,low-resolution,partially occluded and some regions of the image background are very similar to the detection target.

A typical object detection approach is the scanning window.For such scenario in this image(indicate in the image),this approach may give you a lot of wrong result.But if we already know it is a pose of casting a ball.With the help of the pose,especially the location of the upper arm,we can locate the detection target easier.That is,the human pose estimation facilitates the detection of object.

Human pose estimation and object detection,this two difficult tasks can benefit greatly from serving as context for each other,which is the intuition that we use the mutual context in this paper.


approach in this paper

In this paper,by introducing a set of “atomic pose”,which can be thought of as a dictionary of human poses,where the same atomic pose describes those human poses which have the same layouts of body parts just as the picture shown,we learn an overall relationship between different activities,objects and human poses,rather than modeling the human-object interactions for each activity. And the model in this paper can deal with the situations where human interacts with any number of objects.So by allowing human pose estimation and object detection server as context for each other,the recognition performance of both tasks are improved.Two contexts,the co-occurrence context which models the co-occurrence statistics between objects and specific types of human poses within each activity and the spatial context which models the spatial relationship between objects and different human body parts, are considered in the mutual context model.Besides,it also incorporates a discriminative action classification component and the state-of-the-art object and body part detectors,which further improves the recognition performance.

mutual context model

這裡寫圖片描述
A graphical illustration of the model is shown in this picture(indicate it in the image).First the activity,such as tennis forehand and volleyball smash,is represented by A;the object,like tennis racket and volleyball,is represented by O;H indicates the atomic pose label that the human pose belongs to.For each overall human pose,we decompose it into the spatial layout of a number of body parts denoted by P,which like head,torso ,upper-left-arm and so on;

model representation

Putting everything together,the model can be represented as the formulation.

這裡寫圖片描述

This is a conditional random field model and the term,CRF,here is a kind of statistical modeling method often applied in pattern recognition.As you can see from the formulation,ϕ1

\phi_1
models the co-occurrence compatibility between A
A
,O
O
and H
H
,ϕ2
\phi_2
considers the spatial relationship between O
O
and H
H
,and ϕ3
\phi_3
to ϕ5
\phi_5
models the image evidence based on state-of-the-art object detection,human pose estimation and activity classification approaches.So this is overview of the mutual model and I am going to introduce the potentials of this model.


co-occurrence context

這裡寫圖片描述
ϕ1()

\phi_{1}()
models the compatibility between activity ,object and human poses in terms of co-occurrence frequency. For instance ,”tennis ball”and “tennis racket” always appear in the same activity,like “tennis serve”,and people usually serve the tennis in several specific poses.In the formulation,Nh is the total number of atomic poses(Namely the letter N means “the Number of something”,so,you should already know the meaning of the No and Na),and hi here represents the ith
i-th
atomic pose.The 1 here is a indicator function,which 'returns' 1 if H equals hi otherwise 'returns' 0.And the ζi,j,k
\zeta_{i,j,k}
represents the strength of the co-occurrence between hi
h_{i}
,oj
o_{j}
and ak
a_{k}
,that is to say,the larger ζi,j,k
\zeta_{i,j,k}
is,the more likely for hi
h_{i}
,oj
o_{j}
and ak
a_{k}
to co-occur.


spatial context

這裡寫圖片描述
The next is ϕ2()

\phi_{2}()
which considers the spatial relationship between object and different parts of human.M is the number of object bounding boxes in image.the bounding box is a term in geometry,which is the box with the smallest measure(area,volume and so on) to enclose all the points of a set.So the object bounding box can be interpreted as the smallest box which can enclose the object.For example,M is 2 here.λi,j,l
\lambda_{i,j,l}
here encodes the set of weights for the relationship when the object class of Om
O^{m}
is oj.While the b(xlI,Om)
b(x_{I}^{l},O^{m})
here denotes the spatial relationship between xlI
x_{I}^{l}
, the location of the center of the human’s l-th body part,and Om
O^{m}
.b(xlI,Om)
b(x_{I}^{l},O^{m})
can be represented by a binary feature.
這裡寫圖片描述
The picture is a visualization of the binary feature b()
b()
.The space is divided into 13 disjoint regions based on a coordinate frame defined by the xlI
x_{I}^{l}
. Because the b(xlI,Om)
b(x_{I}^{l},O^{m})
is a sparse binary vector with only one 1 for the element that corresponds to the relative location of Om
O^{m}
with respect to xlI
x_{I}^{l}
,the b(xlI,Om)
b(x_{I}^{l},O^{m})
here is a 13 dimensional binary vector with a 1 in the region that filled by gray color,namely the spatial relationship between the arm and the tennis racket can be denoted by this 13 dimensional binary vector.


other

Maybe a little tedious,right?But this it is and we also have the ϕ3()

\phi_{3}()
,
modeling object,ϕ4()
\phi_{4}()
,modeling human pose, and ϕ5()
\phi_{5}()
,modeling activities, next.While,after we talk about the co-occurrence context and the spatial context,you may have a sense of doubt,where is the mutual context that we emphasize all the time?

Okay,imprecisely,just,imprecisely,the mutual context means the human pose can be the context to the object detection,as well as object detection.So the mutual context can be parameterized as the the model we give.Activity,object,human pose and interaction,each of them is indispensable for the mutual context.However,in consideration of our limited time tonight,so we have to focus on some parts of the model while overlook others.And I am going to go through very quickly for the next three parts of the mutual context model.

ϕ3()

\phi_{3}()
model objects in the image using object detection scores in all the object bounding boxes and the spatial relationship between these boxes and ϕ4()
\phi_{4}()
models the atomic pose that H belongs to and the Likelihood of observing image I given the atomic pose hi.The last one,ϕ5()
\phi_{5}()
,takes the global image as features and train an activity classifier to model the HOI.It serves as the activity context to understand the interactions between humans and objects.


conclusion for the mutual context

The mutual context model have the following properties.The remarkable of these is the third.Compared to the original method where the objects and human poses in each HOI activity are modeled separately,this model is easier to extend to larger scale data set and other activities because it jointly all the objects and atomic poses in all the HOI activities.So this model has universality and it is valuable.


model learning

I am going to introduce the model learning briefly.

Just as we talked before,the atomic poses can be thought of as a dictionary of human poses.And it plays a very important role in modeling human and object interactions.We can do the human poses estimation much easier if we know which atomic pose which the image corresponds to.

In this paper,we obtain the atomic poses by clustering the configurations of human body parts,which is based on the annotation of human body parts.And compared to those clusters obtained within each activity class separately,atomic poses here are shared by all the activities,so it easier to extend to more activity class.

This picture shows all the atomic poses which obtained from the sport data set.
Thicker lines indicate stronger connections.For instance,there is a thicker line for the “tennis forehand” and the forth atomic pose.Namely there is a strong connection between them.

I am not intend to introduce the detectors and classifiers’ training and the model parameters estimating.However,we should note that the mutual context model is a standard conditional random field with no hidden variables.


model inference

Now let me briefly goes to the model inference and then dive to the experiments.

We initialize the model inference with the spatial pyramid matching(SPM) action classification result,object bounding boxes obtained from independent object detectors,and initial pose estimation results from a pictorial structure model.In order to reduce false negatives in object detection,we kept those bounding boxes if score which gets by the object detectors are larger than 0.9.

We use pictorial structure with Gaussian distribution to update the pose estimation results here.And we use a greedy forward search method to update the object detection results.Last,we optimize the mutual context model by enumerating all possible combination of A and H labels.


experiment

So now let’s go through the experiments.

This paper use two known data sets of HOI activities,the sports data set and the PPMI data set.I will focus on the first one.
這裡寫圖片描述
The sport data set contains six classes and each class contains thirty training images and twenty testing images.let’s look at the object detection first.
Here we evaluate the performance of detecting each object in all the testing images.We use average precision to measure the performance.We can observe that our detection approach achieves the best performance compared to the contrast approaches.Whether the baseline,deformable part model,or the object context and person context.For example,in the case of cricket ball and croquet ball ,which are very similar and are very difficultly to detect by traditional methods,the deformable part model gives performance of 24\% and 50\%,while our method yields 32\% and 58\%.

這裡寫圖片描述
Next is the human poses estimation.Similarly to object detection,our method performance better than the state-of-the-art method,the pictorial structure method.”PS” in this table stands for “pictorial structure” and “class-based PS” means training one pictorial structure using image of each class.Note that our methods even shows a 3\% average improvement over a class-based pictorial structure model where use the ground truth activity class label.

這裡寫圖片描述
Last is the activity classification.our model gives a prediction of the class label of the HOI activity.We observe that our method outperforms the other approaches here.


conclusion

The major contribution of this paper’s work is to demonstrate the importance of context in visual recognition,Specifically we found that the mutual context between object and human pose can significantly improve the recognition.

The limitation of our work is we need to annotate the human body parts and objects in each training image.

One direction of our future work is to study weakly supervised or unsupervised approaches to understand human-object interaction activities.


感謝本次研討課的老師,Ruizhen Hu,給了我們特別多關於論文閱讀的方法。當然也感謝這篇論文的作者!


Reference

http://videolectures.net/cvpr2010_fei_fei_mmco/
http://ieeexplore.ieee.org/document/6165303/
presentation ppt: https://pan.baidu.com/s/1c1BfgMc 密碼: 9ega
presentation script: https://pan.baidu.com/s/1o8v1WuA 密碼: vbvy

相關文章