主成分與因子分析

Valentin發表於2017-01-11

主成分與因子分析 Valentin 2016年12月30日 1.PCA主成分分析和EFA因子分析 a.資料預處理 資料必須先進行缺失值處理、標準化,才能進行主成分和因子分析。 可以輸入原始資料矩陣或者相關係數矩陣 通過principle()以及fa()函式分別計算主成分以及因子。 b.選擇因子模型 判斷PCA或者EFA更適合。 c.判斷要選擇的主成分/因子數目 d.旋轉主成分/因子 e.解釋結果 f.計算主成分和因子得分

2PCA 2.1PCA PCA相當於萃取,目標是用盡可能少的無關變數代替原來大量的相關變數,方法是通過方差分析,判別那些變數對目標變數的方差解釋貢獻最大。第一主成分定義如下,之後的每一主成分都擴大了對目標變數方差的解釋,同時保持與之前主成分的正交(不相關)。 PC_1=∑_(i=0)^k▒a_i X_i (k<=n) 例如,將下面的資料集的變數個數削減,同時保證能夠正確反映法官的評分

2.2判斷主成分個數 根據先驗經驗和理論知識判斷 根據要解釋變數方差的積累制的閾值判斷 通過檢查變數間K*K的相關係數矩陣判斷 這裡用的是特徵值檢驗的方法。每個主成分都與相關係數矩陣相關聯,主成分排序按照特徵值大小由大到小依次為第一主成分、第二主成分。。。 1.因此,依據Kasier-Harris準則,建議保留特徵值大於一的主成分,因此特徵值大於一的特徵值的個數為要保留的主成分的個數! 2.此外,還有平行分析法,通過模擬的方法,依據與初始矩陣相同大小的隨機資料來判斷要提取的特徵值。若基於真實資料的某個特徵值大於一組隨機資料矩陣相應的平均特徵值,那麼該主成分予以保留。 3.此外,通過繪製Cattell碎石圖,該圖繪製有特徵值與主成分關係的圖形,圖形變化最大處之上主成分都可以保留。 如下圖,通過以上三種方法分析發現該資料集主成分個數為1 library(psych) fa.parallel(USJudgeRatings[,-1],fa="pc",n.iter = 100,show.legend = FALSE,main="Scree plot with parallel analysis")

Parallel analysis suggests that the number of factors = NA and the number of components = 1

2.3提取主成分 principle(r,nfactors=,rotate=,scores=) r是相關係數矩陣或者原始資料矩陣 nfactors為設定的主成分的個數 rotate指定旋轉方法 scores設定是否需要計算主成分得分 例一 library(psych) pc<-principal(USJudgeRatings[,-1],nfactors = 1) pc

Principal Components Analysis

Call: principal(r = USJudgeRatings[, -1], nfactors = 1)

Standardized loadings (pattern matrix) based upon correlation matrix

PC1 h2 u2 com

INTG 0.92 0.84 0.1565 1

DMNR 0.91 0.83 0.1663 1

DILG 0.97 0.94 0.0613 1

CFMG 0.96 0.93 0.0720 1

DECI 0.96 0.92 0.0763 1

PREP 0.98 0.97 0.0299 1

FAMI 0.98 0.95 0.0469 1

ORAL 1.00 0.99 0.0091 1

WRIT 0.99 0.98 0.0196 1

PHYS 0.89 0.80 0.2013 1

RTEN 0.99 0.97 0.0275 1

PC1

SS loadings 10.13

Proportion Var 0.92

Mean item complexity = 1

Test of the hypothesis that 1 component is sufficient.

The root mean square of the residuals (RMSR) is 0.04

with the empirical chi square 6.21 with prob < 1

Fit based upon off diagonal values = 1

成分載荷陣: PC1代表第一主成分 h2代表成分公因子方差,表示主成分對每個變數的方差解釋度 u2代表成分唯一性,表示方差無法被主成分解釋的比例 SS Loadings行代表與主成分相關聯的特徵值 Proportion Var代表每個主成分對整個資料集的解釋程度 例二 判斷主成分個數 library(psych) fa.parallel(Harman23.cor$cov,n.obs = 302,fa="pc",n.iter = 100,show.legend = FALSE,main="Scree plot with parallel analysis")

Parallel analysis suggests that the number of factors = NA and the number of components = 2

pc

Principal Components Analysis

Call: principal(r = USJudgeRatings[, -1], nfactors = 1)

Standardized loadings (pattern matrix) based upon correlation matrix

PC1 h2 u2 com

INTG 0.92 0.84 0.1565 1

DMNR 0.91 0.83 0.1663 1

DILG 0.97 0.94 0.0613 1

CFMG 0.96 0.93 0.0720 1

DECI 0.96 0.92 0.0763 1

PREP 0.98 0.97 0.0299 1

FAMI 0.98 0.95 0.0469 1

ORAL 1.00 0.99 0.0091 1

WRIT 0.99 0.98 0.0196 1

PHYS 0.89 0.80 0.2013 1

RTEN 0.99 0.97 0.0275 1

PC1

SS loadings 10.13

Proportion Var 0.92

Mean item complexity = 1

Test of the hypothesis that 1 component is sufficient.

The root mean square of the residuals (RMSR) is 0.04

with the empirical chi square 6.21 with prob < 1

Fit based upon off diagonal values = 1

提取主成分 library(psych) PC<-principal(Harman23.cor$cov,nfactors = 2,rotate ="none") PC

Principal Components Analysis

Call: principal(r = Harman23.cor$cov, nfactors = 2, rotate = "none")

Standardized loadings (pattern matrix) based upon correlation matrix

PC1 PC2 h2 u2 com

height 0.86 -0.37 0.88 0.123 1.4

arm.span 0.84 -0.44 0.90 0.097 1.5

forearm 0.81 -0.46 0.87 0.128 1.6

lower.leg 0.84 -0.40 0.86 0.139 1.4

weight 0.76 0.52 0.85 0.150 1.8

bitro.diameter 0.67 0.53 0.74 0.261 1.9

chest.girth 0.62 0.58 0.72 0.283 2.0

chest.width 0.67 0.42 0.62 0.375 1.7

PC1 PC2

SS loadings 4.67 1.77

Proportion Var 0.58 0.22

Cumulative Var 0.58 0.81

Proportion Explained 0.73 0.27

Cumulative Proportion 0.73 1.00

Mean item complexity = 1.7

Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is 0.05

Fit based upon off diagonal values = 0.99

可以看出,第一主成分與每個指標正相關,第二主成分則與後四個變數正相關與前四個變數負相關,而我們希望對每個主成分的解釋性達到最好賦予每個主成分更獨立的意義,因此引入了主成分旋轉! 2.4主成分旋轉 目的:去噪,使得成分載荷陣變的更容易解釋 方法:正交旋轉(不相關)&斜交旋轉(相關) 正交旋轉:使選擇成分保持不相關,流行的有方差極大旋轉。它通過對載荷陣的列進行去噪,使每一個主成分只由一組有限的變數來解釋,即載荷陣每列只有少數幾個很大的載荷,其他的載荷值都很小。 library(psych) rc<-principal(Harman23.cor$cov,nfactors = 2,rotate = "varimax") rc

Principal Components Analysis

Call: principal(r = Harman23.cor$cov, nfactors = 2, rotate = "varimax")

Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 h2 u2 com

height 0.90 0.25 0.88 0.123 1.2

arm.span 0.93 0.19 0.90 0.097 1.1

forearm 0.92 0.16 0.87 0.128 1.1

lower.leg 0.90 0.22 0.86 0.139 1.1

weight 0.26 0.88 0.85 0.150 1.2

bitro.diameter 0.19 0.84 0.74 0.261 1.1

chest.girth 0.11 0.84 0.72 0.283 1.0

chest.width 0.26 0.75 0.62 0.375 1.2

RC1 RC2

SS loadings 3.52 2.92

Proportion Var 0.44 0.37

Cumulative Var 0.44 0.81

Proportion Explained 0.55 0.45

Cumulative Proportion 0.55 1.00

Mean item complexity = 1.1

Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is 0.05

Fit based upon off diagonal values = 0.99

RC即代表成分已旋轉,對比之前的圖,發現第一主成分主要由前四個變數來解釋(長度變數),第二主成分則有後四個變數解釋,兩主成分依舊不相關,累計方差解釋性依舊沒變(81%),只不過每個主成分對方差解釋度變了(成分一有58%變到44%,成分2由22%變到37%),主成分之間的方差解釋度趨於相同。準確地講,他們現在已不是主成分了(因主成分性質要求單個主成分方差最大化) 但我們的目的在於用一組較少的變數去替換一組較多的變數。 2.5獲取主成分得分 利用根據原始資料的十一個指標提取的主成分對每個調查物件打分 pc<-principal(USJudgeRatings[,-1],nfactors = 1,scores = TRUE) head(pc$scores)

PC1

AARONSON,L.H. -0.1857981

ALEXANDER,J.M. 0.7469865

ARMENTANO,A.J. 0.0704772

BERDON,R.I. 1.1358765

BRACKEN,J.J. -2.1586211

BURNS,E.B. 0.7669406

scores為true時,主成分得分儲存在principal()函式返回物件中 cor(USJudgeRatings$CONT,pc$scores)

PC1

[1,] -0.008815895

cor(USJudgeRatings$INTG,pc$scores)

PC1

[1,] 0.9184219

顯然,律師的評分與律師與法官的熟悉度無關,而與律師的正直程度相關 在例二中 round(unclass(rc$weights),2)

RC1 RC2

height 0.28 -0.05

arm.span 0.30 -0.08

forearm 0.30 -0.09

lower.leg 0.28 -0.06

weight -0.06 0.33

bitro.diameter -0.08 0.32

chest.girth -0.10 0.34

chest.width -0.04 0.27

由此可得: PC_1=0.28*height+0.30*arm.span+0.30*forearm+0.29*lower.leg-0.06*weight-0.08*bitro.diameter-0.10*chest.girth-0.04*chest.width PC_2=-0.05*height-0.08*arm.span-0.09*forearm-0.06*lower.leg+0.33*weight+0.32*bitro.diameter+0.34*chest.girth+0.27*chest.width 兩個等式都假定身體測量指標都已標準化,實際計算過程中可以約去得分較小的變數 3EFA 3.1引入 EFA相當於探索,找到一組隱藏在觀測變數背後的神祕力量(往往無法直接觀測),稱為因子。(每個因子都被認為可以解釋多個變數間共有的方差,也被稱為公共因子) X_i=∑_(j=0)^p▒a_j F_j+U_i (p

general picture blocks maze reading vocab

general 1.00 0.47 0.55 0.34 0.58 0.51

picture 0.47 1.00 0.57 0.19 0.26 0.24

blocks 0.55 0.57 1.00 0.45 0.35 0.36

maze 0.34 0.19 0.45 1.00 0.18 0.22

reading 0.58 0.26 0.35 0.18 1.00 0.79

vocab 0.51 0.24 0.36 0.22 0.79 1.00

3.2判斷因子個數 convariances<-ability.cov$cov correlations<-cov2cor(covariances) fa.parallel(correlations,n.obs=112,fa="both",n.iter=100,main="Scree plots with parallel analysis")

Parallel analysis suggests that the number of factors = 2 and the number of components = 1

根據PCA方法,可能會選一個成分(根據碎石檢驗以及平行分析)或者兩個(特徵值大於一)。出現這種情況一般高估因子數比低估因子數結果更準確,因為高估因子數一般少曲解“真實情況”。 觀察EFA結果,顯然需要提取兩個因子。因為碎石檢驗的前兩個特徵值都在拐角處,並且大於基於100次模擬資料矩陣的特徵值均值。特別需要注意的是,對於EFA,Kaiser—Harris準則要求特徵值大於0即可,而非大於1,所以也是兩個。 3.3提取公因子 fa(r,nfactors=,n.obs=,rotate=,scores=,fm=) r是相關係數矩陣或者原始資料矩陣 nfactors為設定的提取因子數 n.obs是觀測數(當輸入相關係數矩陣時需填寫) rotate指定旋轉方法(預設互變異數最小法) scores設定是否需要計算因子得分 fm設定因子化方法(預設極小殘差法(minres),還包括最大似然法(ml)、主軸迭代法(pa)、加權最小二乘法(gls)) fm的設定視情況而定,統計學家更青睞最大似然法,因其有良好的統計性質,不過有時候最大似然法可能並不收斂,此時用主軸迭代法效果更好 fa<-fa(correlations,nfactors=2,rotate="none",fm="pa") fa

Factor Analysis using method = pa

Call: fa(r = correlations, nfactors = 2, rotate = "none", fm = "pa")

Standardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 h2 u2 com

general 0.75 0.07 0.57 0.432 1.0

picture 0.52 0.32 0.38 0.623 1.7

blocks 0.75 0.52 0.83 0.166 1.8

maze 0.39 0.22 0.20 0.798 1.6

reading 0.81 -0.51 0.91 0.089 1.7

vocab 0.73 -0.39 0.69 0.313 1.5

PA1 PA2

SS loadings 2.75 0.83

Proportion Var 0.46 0.14

Cumulative Var 0.46 0.60

Proportion Explained 0.77 0.23

Cumulative Proportion 0.77 1.00

Mean item complexity = 1.5

Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are 15 and the objective function was 2.5

The degrees of freedom for the model are 4 and the objective function was 0.07

The root mean square of the residuals (RMSR) is 0.03

The df corrected root mean square of the residuals is 0.06

Fit based upon off diagonal values = 0.99

Measures of factor score adequacy

PA1 PA2

Correlation of scores with factors 0.96 0.92

Multiple R square of scores with factors 0.93 0.84

Minimum correlation of possible factor scores 0.86 0.68

觀察得知,兩個因子解釋了六個心理學測驗60%的方差,不過因子載荷陣的意義並不太好解釋,此時轉用因子旋轉方法。 3.4因子旋轉 正交旋轉: fa.varimax<-fa(correlations,nfactors=2,rotate="varimax",fm="pa") fa.varimax

Factor Analysis using method = pa

Call: fa(r = correlations, nfactors = 2, rotate = "varimax", fm = "pa")

Standardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 h2 u2 com

general 0.49 0.57 0.57 0.432 2.0

picture 0.16 0.59 0.38 0.623 1.1

blocks 0.18 0.89 0.83 0.166 1.1

maze 0.13 0.43 0.20 0.798 1.2

reading 0.93 0.20 0.91 0.089 1.1

vocab 0.80 0.23 0.69 0.313 1.2

PA1 PA2

SS loadings 1.83 1.75

Proportion Var 0.30 0.29

Cumulative Var 0.30 0.60

Proportion Explained 0.51 0.49

Cumulative Proportion 0.51 1.00

Mean item complexity = 1.3

Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are 15 and the objective function was 2.5

The degrees of freedom for the model are 4 and the objective function was 0.07

The root mean square of the residuals (RMSR) is 0.03

The df corrected root mean square of the residuals is 0.06

Fit based upon off diagonal values = 0.99

Measures of factor score adequacy

PA1 PA2

Correlation of scores with factors 0.96 0.92

Multiple R square of scores with factors 0.91 0.85

Minimum correlation of possible factor scores 0.82 0.71

觀察得知,閱讀和詞彙量在第一因子上載荷大,畫圖、積木圖案和迷宮在第二個因子上載荷大,普通智力測試在兩個因子上載荷較為平均,這表明存在一個語言智力因子和一個非語言智力因子。 使用正交旋轉人為的強制兩個因子不相關,如果想允許兩個因子相關則要採用斜交轉軸法 斜交旋轉: fa.promax<-fa(correlations,nfactors=2,rotate="promax",fm="pa")

Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate =

rotate, : A Heywood case was detected. Examine the loadings carefully.

fa.promax

Factor Analysis using method = pa

Call: fa(r = correlations, nfactors = 2, rotate = "promax", fm = "pa")

Warning: A Heywood case was detected.

Standardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 h2 u2 com

general 0.36 0.49 0.57 0.432 1.8

picture -0.04 0.64 0.38 0.623 1.0

blocks -0.12 0.98 0.83 0.166 1.0

maze -0.01 0.45 0.20 0.798 1.0

reading 1.01 -0.11 0.91 0.089 1.0

vocab 0.84 -0.02 0.69 0.313 1.0

PA1 PA2

SS loadings 1.82 1.76

Proportion Var 0.30 0.29

Cumulative Var 0.30 0.60

Proportion Explained 0.51 0.49

Cumulative Proportion 0.51 1.00

With factor correlations of

PA1 PA2

PA1 1.00 0.57

PA2 0.57 1.00

Mean item complexity = 1.2

Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are 15 and the objective function was 2.5

The degrees of freedom for the model are 4 and the objective function was 0.07

The root mean square of the residuals (RMSR) is 0.03

The df corrected root mean square of the residuals is 0.06

Fit based upon off diagonal values = 0.99

Measures of factor score adequacy

PA1 PA2

Correlation of scores with factors 0.97 0.94

Multiple R square of scores with factors 0.93 0.89

Minimum correlation of possible factor scores 0.86 0.77

成分載荷陣: PC1代表第一主成分 h2代表成分公因子方差,表示主成分對每個變數的方差解釋度 u2代表成分唯一性,表示方差無法被主成分解釋的比例 SS Loadings行代表與主成分相關聯的特徵值 Proportion Var代表每個主成分對整個資料集的解釋程度 觀察可知,正交旋轉側重於因子結構矩陣(變數與因子相關係數,上圖並未列出),斜交旋轉則考慮三個矩陣:因子結構矩陣、因子模式矩陣(標準化的迴歸係數矩陣,因子預測變數的權重,PA1,PA2)、因子關聯矩陣(因子相關係數矩陣factoor correlation)。 因子結構矩陣可由公式F=P*Phi得到,其中,F是因子載荷陣,P是因子模式矩陣: fsm<-function(oblique){ if(class(oblique)[2]=="fa"&is.null(oblique$Phi)){ warning("Object doesn't look like oblique EFA") }else{ P<-unclass(oblique$loading) F<-P%*%oblique$Phi colnames(F)<-c("PA1","PA2") return (F) } } fsm(fa.promax)

PA1 PA2

general 0.64 0.69

picture 0.33 0.61

blocks 0.44 0.91

maze 0.25 0.45

reading 0.95 0.47

vocab 0.83 0.46

上圖顯示了變數與因子間的相關係數,比較下圖正交旋轉所得因子載荷陣:

可以看到斜交旋轉載荷陣列噪聲更大,這是因為斜交旋轉允許潛在因子之間存在相關性所致。雖然斜交方法更加複雜,但模型一般更符合真實資料。 下面,繪製正交或者斜交結果的圖形: factor.plot(fa.promax,labels=rownames(fa.promax$loadings))

由圖知,詞彙和閱讀在第一因子上載荷較大,而積木和迷宮在第二因子上載荷較大。普通智力測試在兩個因子上較為平均 fa.diagram(fa.promax,simple=TRUE)

上圖得到兩因子斜交旋轉圖,PA1與PA2共解釋0.6的方差 3.5因子方差 相比PCA,EFA並不那麼關注因子得分。通過在fa()函式中新增score=TRUE(原始資料可得時)便可輕鬆獲得因子得分,該函式使用迴歸的方法,另外還可得到得分系數(標準化的迴歸權重),他們都存在weight元素中 fa.promax$weights

PA1 PA2

general 0.080 0.210

picture 0.021 0.090

blocks 0.044 0.695

maze 0.027 0.035

reading 0.739 0.044

vocab 0.176 0.039

例二 Harman74.cor中包含24個心理學測試的結果 fa.parallel(Harman74.cor$cov,n.obs=112,fa="both",n.iter=100,main="Scree plots with parallel analysis")

Parallel analysis suggests that the number of factors = 4 and the number of components = 3

fa.24testes<- fa(Harman74.cor$cov,nfactors = 4,rotate = "promax") fsm2<-function(oblique){ if(class(oblique)[2]=="fa"&is.null(oblique$Phi)){ warning("Object doesn't look like oblique EFA") }else{ P<-unclass(oblique$loading) F<-P%*%oblique$Phi colnames(F)<-c("PA1","PA2","PA3","PA4") return (F) } } fsm2(fa.24testes)

PA1 PA2 PA3 PA4

VisualPerception 0.38 0.74 0.35 0.41

Cubes 0.25 0.46 0.19 0.25

PaperFormBoard 0.29 0.56 0.12 0.28

Flags 0.38 0.58 0.24 0.30

GeneralInformation 0.80 0.49 0.42 0.43

PargraphComprehension 0.83 0.49 0.31 0.47

SentenceCompletion 0.84 0.48 0.36 0.36

WordClassification 0.68 0.57 0.43 0.41

WordMeaning 0.86 0.49 0.29 0.47

Addition 0.29 0.24 0.84 0.37

Code 0.36 0.40 0.62 0.54

CountingDots 0.20 0.42 0.73 0.32

StraightCurvedCapitals 0.38 0.61 0.62 0.37

WordRecognition 0.33 0.27 0.25 0.58

NumberRecognition 0.27 0.29 0.22 0.55

FigureRecognition 0.29 0.52 0.25 0.61

ObjectNumber 0.31 0.31 0.37 0.62

NumberFigure 0.25 0.48 0.46 0.58

FigureWord 0.30 0.39 0.30 0.47

Deduction 0.53 0.58 0.31 0.50

NumericalPuzzles 0.37 0.57 0.55 0.46

ProblemReasoning 0.52 0.57 0.32 0.50

SeriesCompletion 0.56 0.68 0.43 0.50

ArithmeticProblems 0.53 0.47 0.63 0.53

fa.diagram(fa.24testes,simple=FALSE)

fa.24testes$weights

MR1 MR3 MR2 MR4

VisualPerception -0.0024 0.2409 0.02493 0.0121

Cubes 0.0017 0.0848 0.00264 0.0035

PaperFormBoard 0.0029 0.1342 -0.02233 0.0068

Flags 0.0163 0.1201 0.00312 -0.0125

GeneralInformation 0.1768 0.0153 0.04359 -0.0040

PargraphComprehension 0.2101 0.0128 -0.02288 0.0679

SentenceCompletion 0.2471 0.0243 0.02754 -0.0695

WordClassification 0.0893 0.0755 0.04002 -0.0118

WordMeaning 0.2714 0.0084 -0.04253 0.0751

Addition 0.0247 -0.0906 0.45930 0.0472

Code 0.0085 0.0149 0.10626 0.1127

CountingDots -0.0268 0.0875 0.21379 -0.0198

StraightCurvedCapitals 0.0053 0.1402 0.12658 -0.0342

WordRecognition 0.0171 -0.0201 -0.00541 0.1841

NumberRecognition 0.0046 0.0016 -0.00558 0.1607

FigureRecognition -0.0126 0.0919 -0.01646 0.1935

ObjectNumber 0.0062 -0.0145 0.02571 0.2027

NumberFigure -0.0195 0.0633 0.05400 0.1459

FigureWord 0.0053 0.0339 0.01228 0.0897

Deduction 0.0405 0.0794 0.00013 0.0712

NumericalPuzzles 0.0035 0.0943 0.08262 0.0362

ProblemReasoning 0.0381 0.0780 0.00139 0.0708

SeriesCompletion 0.0411 0.1335 0.03489 0.0421

ArithmeticProblems 0.0457 0.0208 0.11061 0.0776

4.小結 我們學習的PCA能夠儘可能用一組不相關變數來代替大量相關變數簡化分析過程,EFA則用來發現一組可觀測變數背後潛在的或無法觀測的結構。PCA常用於降低資料維度,EFA則常用於社會科學的理論研究。 其實,除了EFA潛變數模型,R語言還存在其他優秀的潛變數模型,如檢驗先驗知識模型、處理混合資料(數值型和類別型)模型、類別型多因素表的模型。

相關文章