資料探勘-預測模型彙總

weixin_45727991發表於2020-11-08

各種預測模型彙總

二、各種預測模型

先總結弄懂了的:

1、naiveBayes(第5課)

#應變數y為email$spam,“~.”表示身下的所有屬性都是自變數
#第二個引數我也不知道
#第三個引數為資料來源
NBfit<-naiveBayes(as.factor(email$spam)~.,laplace=0,data=email)

#用naiveBayes的結果做預測,第一個引數為用預測函式形成的物件,第二個引數為被預測的自變數的值
#再把預測好的應變數的值存入物件pred2中
pred2<-predict(NBfit,email[,2:19])

#比較預測值與真實值的差別
#形成一個2*2的矩陣,對角線為正確的,其餘為預測錯誤的
table(pred2, email$spam)

2、線性迴歸(lm函式)-第4課

#createDataPartition函式用於製作訓練集,其中y=faithful$waiting指按照該屬性分類,p=0.5指將p*100%的值用於訓練,list=FALSE不用管,預設為false
inTrain<-createDataPartition(y=faithful$waiting, p=0.5, list=FALSE)
#其中50%用於訓練
trainFaith<-faithful[inTrain,]
#剩下50%用於測試
testFaith<-faithful[-inTrain,]

#迴歸模型函式在這裡################################
#同上,y~x,資料來源
lm1<-lm(eruptions~waiting, data=trainFaith)

newdata<-data.frame(waiting=80)
predict(lm1, newdata)

注:由於lm函式的預測值是一個置信度為大致0.95左右(可以自己定義)的區間,所以predict共有3個屬性結果,所以不可以用它來預測table之類的混淆矩陣

3、決策樹:(rpart函式)

# grow the tree 
fit <- rpart(Kyphosis ~ Age + Number + Start,
             method="class", data=kyphosis)

printcp(fit) # display the results 
plotcp(fit) # visualize cross-validation results 
summary(fit) # detailed summary of splits

# plot tree 
rpart.plot(fit,extra=106, under=TRUE, faclen=0,
           cex=0.8, main="Decision Tree")

# prediction
result <- predict(fit,kyphosis[,-1],type="class") 

# confusing matrix
table(pred = result, true = kyphosis[,1])

以下左圖為視覺化交叉驗證結果,右圖為系統做出的決策樹
在這裡插入圖片描述

4、剪枝數(是決策樹的一種,也是rpart函式,只不過引數中多了一個control = rpart.control(minsplit = 10)約束,其中minsplit表示:試圖進行分割時,節點中必須存在的最小觀測數。)

# prune the tree 
# minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted.
pfit1 <- rpart(Kyphosis ~ Age + Number + Start,
               method="class", data=kyphosis,
               control = rpart.control(minsplit = 10)) 

# plot the pruned tree 
rpart.plot(pfit1,extra=106, under=TRUE, faclen=0,
           cex=0.8, main="Decision Tree")
#prediction
result<-predict(pfit1,kyphosis[,-1],type="class")
# confusing matrix
table(pred = result, true = kyphosis[,1])

5、迴歸樹(也是rpart函式的一種,只不過method=“anova”,只有這點不同)-第7課

fit <- rpart(Mileage~Price + Country + Reliability + Type, 
             method="anova", data=cu.summary)

6、隨機森林(先用n棵樹預測,再將這n棵樹的結果投票選出最優預測,權重相同,同時進行)-第8課

## Random Forest model
# mtry is the number of variables to try這裡的ntree=100就是說要用100棵樹先同時預測
fit <- randomForest(y ~ .,   data=train, mtry=4, ntree=100)
print(fit) # view results 
importance(fit) # importance of each predictor
varImpPlot(fit)

# prediction resutls
RandomTreeresult<-predict(fit,test[,-17],type="class")
# confusing matrix
table(pred = RandomTreeresult, true = test$y)

7、adaboost森林(依次進行不同的預測樹,每一次錯誤時,將權重增大,正確時將權重減小,下一次的值由上一次的值遞推得到)-第8課
eg:

adaboost<-boosting(y~., data=train, boos=FALSE, mfinal=20,coeflearn='Breiman')

summary(adaboost)
adaboost$trees
adaboost$weights
adaboost$importance
importanceplot(adaboost)

# error
errorChange<-errorevol(adaboost,train)
plot(errorChange$error,type='l')

# peformance on adTest Data 
adpred<-predict(adaboost,test)
table(pred=adpred$class,true=test$y)

# Trees Visulizations: T1,T2, T19, T20
t1<-adaboost$trees[[1]]
t2<-adaboost$trees[[2]]

rpart.plot(t1,under=TRUE, faclen=0,
           cex=1, main="Decision Tree 1")
rpart.plot(t2,under=TRUE, faclen=0,
           cex=1, main="Decision Tree 2")

2、邏輯迴歸(Logistic regression)(第5課)
是一種線性的迴歸

#線性迴歸分析常用glm函式——邏輯迴歸
#第一個引數的格式為:y~x,y為應變數,x為自變數,當x為.時,自變數取除了y以外的所有屬性
#第二個引數為資料來源,第三個參數列示預測集為二元
g_full=glm(spam~ ., data=email, family="binomial")

#接下來進行預測
#confusionMatrix函式的第一個參數列示預測的向量,第二個表示真實的向量
#這個函式的作用是①列出table ②分析精準度等等引數
#由於predict函式預測出來的結果是實數,而這裡只需要0/1即可,所以要用ifelse來制約一下
pred<-ifelse(predict(g_full, email[,2:19])>0.5,1,0)
acc <- confusionMatrix(as.factor(pred), as.factor(email$spam))
acc

3、KNN(K Nearest Neighbor Classifiers)(第5課)
維度很高時不適用,運算過程太耗費資源
(想象一個圓圈用於分類不同型別)

# normalize
#標準化
iris_new<-scale(iris_random[,-5],center = TRUE,scale = TRUE)

# data visulization
# 劃散點圖
ggplot(aes(iris_random$Sepal.Length, iris_random$Petal.Width), data = iris_random)+ 
  geom_point(aes(color= factor(iris_random$Species)))

# constrcut training and testing mannually
#抽出訓練和測試資料集
train <- iris_new[1:100,]
test <- iris_new[101:150,]

train_sp <- iris_random[1:100,5]
test_sp <- iris_random[101:150,5]

# knn訓練
model <- knn(train= train,test=test,cl= train_sp,k=8)
model

#彙總統計分佈
table(factor(model))

#真實和預測的分佈
table(test_sp,model)

#choose the right k####
#First Attempt to Determine Right K ####

#儲存準確率
iris_acc<-numeric() #Holding variable

#k從1-50取值,看哪一個準確率高
for(i in 1:50){
  #Apply knn with k = i
  predict<-knn(train= train,test=test,cl= train_sp,k=i)
  iris_acc<-c(iris_acc,
              mean(predict==iris_random[101:150,5]))
}

#Plot k= 1 through 50

#畫圖,準確率和k值的關係
plot(1-iris_acc,type="l",ylab="Error Rate",
     xlab="K",main="Error Rate for Iris With Varying K")

# 從影像中看出k= 12 時最優
# optimal k =12
model <- knn(train= train,test=test,cl= train_sp,k=12)
table(test_sp,model)

4、perceptron(是一種線性分類器)

##iris DATA
#鳶尾花的那些資料,實驗方法和上面的例子相同
data(iris)
pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species",
      pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])

# select the Sepal.Width versus Petal.Width scatter plot
x <- cbind(iris$Sepal.Width,iris$Petal.Width)
# label setosa as positive and the rest as negative
Y <- ifelse(iris$Species == "setosa", +1, -1)
# # plot all the points
plot(x,cex=0.5)
# use plus sign for setosa points
points(subset(x,Y==1),col="black",pch="+",cex=2)
# use minus sign for the rest
points(subset(x,Y==-1),col="red",pch="-",cex=2)

p <- perceptron(x,Y)

plot(x,cex=0.2)
points(subset(x,Y==1),col="black",pch="+",cex=2)
points(subset(x,Y==-1),col="red",pch="-",cex=2)

# compute intercept on y axis of separator
# from w and b
intercept <- - p$b / p$w[[2]]
# compute slope of separator from w
slope <- - p$w[[1]] /p$ w[[2]]
# draw separating boundary
abline(intercept,slope,col="red")

5、SVM(Support vector machines)

#subset 1: setosa and virginica
#把其中兩種花區分出來(易區分)
iris.part1 = subset(iris, Species != 'versicolor')
pairs(iris.part1[,1:4], col=iris.part1$Species)
iris.part1$Species = factor(iris.part1$Species)

#subset 2: versicolor and virginica
#把另外兩種花區分出來(不易區分)
iris.part2 = subset(iris, Species != 'setosa')
pairs(iris.part2[,1:4], col=iris.part2$Species)
iris.part2$Species = factor(iris.part2$Species)

iris.part1 = iris.part1[, c(1,2,5)]
iris.part2 = iris.part2[, c(1,2,5)]

# linear
#線性可分的情況
plot(iris.part1$Sepal.Length,iris.part1$Sepal.Width,col=iris.part1$Species)

fit1 = svm(Species ~ ., data=iris.part1, type='C-classification')
plot(fit1, iris.part1)
fit1.pred<-predict(fit1,iris.part1[,-3])
table(pred = fit1.pred, true = iris.part1[,3])


# linear
#部分線性不可分的情況
fit2 = svm(Species ~ ., data=iris.part2, type='C-classification', kernel='linear')
plot(fit2, iris.part2)
fit2.pred<-predict(fit2,iris.part2[,-3])
table(pred = fit2.pred, true = iris.part2[,3])


# Non-linear
#線性不可分
plot(iris.part2$Sepal.Length,iris.part2$Sepal.Width,col=iris.part2$Species)

fit3 = svm(Species ~ ., data=iris.part2, type='C-classification', kernel='radial')
plot(fit3, iris.part2)
fit3.pred<-predict(fit3,iris.part2[,-3])
table(pred = fit3.pred, true = iris.part2[,3])


# multiple-class
#多類別的判定
svm_model <- svm(Species ~ ., data=iris)
summary(svm_model)

x <- subset(iris, select=-Species)
y <- iris$Species

fit4.pred<-predict(svm_model,x)
table(pred = fit4.pred, true = y)

plot(svm_model, iris, Petal.Width ~ Petal.Length,
     slice = list(Sepal.Width = 3, Sepal.Length = 4))

7、拉索迴歸和ridge迴歸-第4課(沒看懂,mark)

預測方式、方法

1、將二元的y值預測強制轉換

##因為predict函式預測出來的值是實數,而結果值只有1、0,這樣一來就
##需要ifelse函式用於制約y值了,當大於0.5時y值為1,否則為0
pred<-ifelse(predict(g_full, email[,2:19])>0.5,1,0) 

2、混淆矩陣
①table
該函式的兩個引數分別為預測的所有y值,和真實的所有y值

table(pred2, email$spam)


confusionMatrix函式,引數和table一致,只是需要將格式更改為factor才可行

acc <- confusionMatrix(as.factor(pred2), as.factor(email$spam))
acc

相關文章