R語言學習-高階資料管理

weixin_34110749發表於2019-01-23

原文網址 : https://blog.csdn.net/weixin_34110749/article/details/87789719

還是舉一個例子引出高階資料管理的操作
首先建立一個資料框

> Student<-c("John Davis","Angela Williams","Bullwinkle Moose","David Jones","Janice Markhammer","Cheryl Cushing","Reuven Ytzrhak","Greg Knox","Joel Enghland","Marry Rayburn")
> math<-c(502,600,412,358,495,512,410,625,573,522)
> Science<-c(95,99,80,82,75,85,80,95,89,86)
> English<-c(25,22,18,15,20,28,15,30,27,18)
> grade<-data.frame(Student,math,Science,English)
> grade
             Student math Science English
1         John Davis  502      95      25
2    Angela Williams  600      99      22
3   Bullwinkle Moose  412      80      18
4        David Jones  358      82      15
5  Janice Markhammer  495      75      20
6     Cheryl Cushing  512      85      28
7     Reuven Ytzrhak  410      80      15
8          Greg Knox  625      95      30
9      Joel Enghland  573      89      27
10     Marry Rayburn  522      86      18

現在要解決的問題如下：
1.給學生確定一個單一的成績衡量指標，需要將這些科目的成績組合起來
2.將成績前20%的評為A，接下來的20%評定為B，以此類推
3.按學生名字的字母順序對學生進行排序
在解決問題之前我們有必要了解數值和字元處理函式

數學函式

abs(x) 絕對值
sqrt(x) 平方根
ceiling(x) 不小於x的最小整數
floor(x) 不大於x的最大整數
trunc(x) 取整
round(x,digits=n) 將x舍入為指定位的小數
signif(x,digits=n) 將x舍入為指定的有效數字位數

統計函式

mean(x) 平均數
median(x) 中位數
sd(x) 標準差
var(x) 方差
mad(x) 絕對中位差
quantile(x,probs) 求分位數
range(x) 求值域
sum(x) 求和
diff(x,lag=n) 滯後差分
min(x) 最小值
max(x) 最大值
scale() 將各向量值中心化後標準化

字元處理函式

nchar(x) 統計x中字元數量

sub(pattern,replacement,ignore.case=FALSE,fixed=FALSE) 在x中搜尋
pattern，並以文字replacement替換

strsplit(x,split,fixed=FALSE) 在x處分割字元向量x中的元素

接下來我們著手處理一下上面提到的問題
1.將學生的各科成績組合為單一的成績衡量指標

> grade
             Student math Science English
1         John Davis  502      95      25
2    Angela Williams  600      99      22
3   Bullwinkle Moose  412      80      18
4        David Jones  358      82      15
5  Janice Markhammer  495      75      20
6     Cheryl Cushing  512      85      28
7     Reuven Ytzrhak  410      80      15
8          Greg Knox  625      95      30
9      Joel Enghland  573      89      27
10     Marry Rayburn  522      86      18
> z<-scale(grade[,2:4]) #將各科分數分別中心化後標準化以便於比較
> z
             math     Science     English
 [1,]  0.01269128  1.07806562  0.58685145
 [2,]  1.14336936  1.59143020  0.03667822
 [3,] -1.02568654 -0.84705156 -0.69688609
 [4,] -1.64871324 -0.59036927 -1.24705932
 [5,] -0.06807144 -1.48875728 -0.33010394
 [6,]  0.12806660 -0.20534583  1.13702468
 [7,] -1.04876160 -0.84705156 -1.24705932
 [8,]  1.43180765  1.07806562  1.50380683
 [9,]  0.83185601  0.30801875  0.95363360
[10,]  0.24344191 -0.07700469 -0.69688609
attr(,"scaled:center")
   math Science English 
  500.9    86.6    21.8 
attr(,"scaled:scale")
     math   Science   English 
86.673654  7.791734  5.452828 
> score<-apply(z,1,mean) #分別對z的每行求均值
> score
 [1]  0.5592028  0.9238259 -0.8565414 -1.1620473 -0.6289776  0.3532485 -1.0476242
 [8]  1.3378934  0.6978361 -0.1768163
> grade<-cbind(grade,score)
> grade #將比較得分與資料框結合
             Student math Science English      score
1         John Davis  502      95      25  0.5592028
2    Angela Williams  600      99      22  0.9238259
3   Bullwinkle Moose  412      80      18 -0.8565414
4        David Jones  358      82      15 -1.1620473
5  Janice Markhammer  495      75      20 -0.6289776
6     Cheryl Cushing  512      85      28  0.3532485
7     Reuven Ytzrhak  410      80      15 -1.0476242
8          Greg Knox  625      95      30  1.3378934
9      Joel Enghland  573      89      27  0.6978361
10     Marry Rayburn  522      86      18 -0.1768163
> y<-quantile(grade$score,c(0.8,0.6,0.4,0.2)) #用quantile函式計算出各個分位數的臨界點
> y
       80%        60%        40%        20% 
 0.7430341  0.4356302 -0.3576808 -0.8947579 
#對各個學生得分的百分數排名重編碼為一個新的類別型等級變數
> grade$level[grade$score>=y[1]]<-"A"
> grade$level[grade$score<y[1] & grade$score>=y[2]]<-"B"
> grade$level[grade$score<y[2] & grade$score>=y[3]]<-"C"
> grade$level[grade$score<y[3] & grade$score>=y[4]]<-"D"
> grade$level[grade$score<y[4]]<-"F"
> grade
             Student math Science English      score level
1         John Davis  502      95      25  0.5592028     B
2    Angela Williams  600      99      22  0.9238259     A
3   Bullwinkle Moose  412      80      18 -0.8565414     D
4        David Jones  358      82      15 -1.1620473     F
5  Janice Markhammer  495      75      20 -0.6289776     D
6     Cheryl Cushing  512      85      28  0.3532485     C
7     Reuven Ytzrhak  410      80      15 -1.0476242     F
8          Greg Knox  625      95      30  1.3378934     A
9      Joel Enghland  573      89      27  0.6978361     B
10     Marry Rayburn  522      86      18 -0.1768163     C
#用strsplit()函式將學生的姓和名拆分
> name<-strsplit(grade$Student," ")
Error in strsplit(grade$Student, " ") : non-character argument
#這裡報錯了，因為Student這個變數不是字串變數
> is.character(grade$Student)
[1] FALSE
> class(grade$Student)
[1] "factor"
#是因子變數，我們將它轉化為字串
> grade$Student<-as.character(grade$Student)
> name<-strsplit((grade$Student)," ")
> name
[[1]]
[1] "John"  "Davis"

[[2]]
[1] "Angela"   "Williams"

[[3]]
[1] "Bullwinkle" "Moose"     

[[4]]
[1] "David" "Jones"

[[5]]
[1] "Janice"     "Markhammer"

[[6]]
[1] "Cheryl"  "Cushing"

[[7]]
[1] "Reuven"  "Ytzrhak"

[[8]]
[1] "Greg" "Knox"

[[9]]
[1] "Joel"     "Enghland"

[[10]]
[1] "Marry"   "Rayburn"
#用sapply()函式提取列表每個成分的第一個元素作為Firstname第二個元素作為Lastname
> Firstname<-sapply(name,"[",1)
> Lastname<-sapply(name,"[",2)
> Firstname
 [1] "John"       "Angela"     "Bullwinkle" "David"      "Janice"     "Cheryl"    
 [7] "Reuven"     "Greg"       "Joel"       "Marry"     
> Lastname
 [1] "Davis"      "Williams"   "Moose"      "Jones"      "Markhammer" "Cushing"   
 [7] "Ytzrhak"    "Knox"       "Enghland"   "Rayburn"   
#刪除原有的姓名，將拆分後的姓氏和姓名與資料框結合
> grade<-grade[,-1]
> grade
   math Science English      score level
1   502      95      25  0.5592028     B
2   600      99      22  0.9238259     A
3   412      80      18 -0.8565414     D
4   358      82      15 -1.1620473     F
5   495      75      20 -0.6289776     D
6   512      85      28  0.3532485     C
7   410      80      15 -1.0476242     F
8   625      95      30  1.3378934     A
9   573      89      27  0.6978361     B
10  522      86      18 -0.1768163     C
> grade<-cbind(Firstname,Lastname,grade)
> grade
    Firstname   Lastname math Science English      score level
1        John      Davis  502      95      25  0.5592028     B
2      Angela   Williams  600      99      22  0.9238259     A
3  Bullwinkle      Moose  412      80      18 -0.8565414     D
4       David      Jones  358      82      15 -1.1620473     F
5      Janice Markhammer  495      75      20 -0.6289776     D
6      Cheryl    Cushing  512      85      28  0.3532485     C
7      Reuven    Ytzrhak  410      80      15 -1.0476242     F
8        Greg       Knox  625      95      30  1.3378934     A
9        Joel   Enghland  573      89      27  0.6978361     B
10      Marry    Rayburn  522      86      18 -0.1768163     C
#最後一步，按照姓名和姓氏進行排序
> grade[order(Firstname,Lastname),]
    Firstname   Lastname math Science English      score level
2      Angela   Williams  600      99      22  0.9238259     A
3  Bullwinkle      Moose  412      80      18 -0.8565414     D
6      Cheryl    Cushing  512      85      28  0.3532485     C
4       David      Jones  358      82      15 -1.1620473     F
8        Greg       Knox  625      95      30  1.3378934     A
5      Janice Markhammer  495      75      20 -0.6289776     D
9        Joel   Enghland  573      89      27  0.6978361     B
1        John      Davis  502      95      25  0.5592028     B
10      Marry    Rayburn  522      86      18 -0.1768163     C
7      Reuven    Ytzrhak  410      80      15 -1.0476242     F
#當然現實一點我們也可以按得分高低排序
> grade[order(-score),]
    Firstname   Lastname math Science English      score level
8        Greg       Knox  625      95      30  1.3378934     A
2      Angela   Williams  600      99      22  0.9238259     A
9        Joel   Enghland  573      89      27  0.6978361     B
1        John      Davis  502      95      25  0.5592028     B
6      Cheryl    Cushing  512      85      28  0.3532485     C
10      Marry    Rayburn  522      86      18 -0.1768163     C
5      Janice Markhammer  495      75      20 -0.6289776     D
3  Bullwinkle      Moose  412      80      18 -0.8565414     D
7      Reuven    Ytzrhak  410      80      15 -1.0476242     F
4       David      Jones  358      82      15 -1.1620473     F
#還可以輸出為Excel表格
> grade<-grade[order(-score),]
> write.csv(grade,file = "grade.csv")

image.png

任務完成

R語言的初級學習
2024-04-27
R語言
Linux C語言高階學習第四天（C高階-函式）
2018-07-20
LinuxC語言函式
R語言學習-迴歸診斷
2019-02-03
R語言
R語言批量建立資料框
2018-09-28
R語言
R語言資料質量分析
2024-03-21
R語言
Go語言學習教程：xorm表基本操作及高階操作
2019-04-04
GoORM
資料庫學習（二）資料操作語言：
2019-01-22
資料庫
Go語言高階資料型別之指標篇
2024-07-06
Go資料型別指標
R語言連線資料庫（MySQL)
2018-07-06
R語言資料庫MySql
R語言入門與資料分析
2024-04-20
R語言
從高階語言到機器語言
2021-03-03
高階C語言7
2024-05-06
C語言
高階C語言1
2024-05-05
C語言
高階C語言2
2024-05-05
C語言
在資料科學領域，Python語言和R語言有何區別？
2021-05-31
資料科學PythonR語言
R語言data manipulation學習筆記之subset data
2018-03-29
R語言筆記
python是高階語言嗎
2021-09-11
Python
【R語言入門】R語言中的變數與基本資料型別
2020-11-28
R語言變數資料型別
Go 語言進階學習路線圖
2020-06-08
Go
C語言學習的幾個階段
2020-10-27
C語言
r語言
2019-10-18
R語言
資料庫習題高階
2019-01-17
資料庫
R語言實戰（1）資料集的建立
2020-06-01
R語言
《R語言入門與資料分析》——向量索引
2020-10-02
R語言索引
R語言批量提取excel當中的資料
2020-11-26
R語言Excel
【R語言入門】R語言環境搭建
2021-09-09
R語言
Gradle學習之三Groovy高階語法
2020-10-16
Gradle
Go語言學習(4) - 基本資料型別
2018-11-26
Go資料型別
GO語言學習——基本資料型別字串
2022-04-16
Go資料型別字串
深圳大資料學習：高階函式--【千鋒】
2019-10-22
大資料函式
如何輕鬆搞定資料科學麵試：Python＆R語言篇
2018-09-20
資料科學PythonR語言
物聯網嵌入式高階C語言流行框架、學習路線圖
2019-11-28
C語言框架
R 語言使用
2024-06-10
【Pandas學習筆記02】-資料處理高階用法
2021-12-01
筆記
C語言學習方法，怎麼學習C語言？
2021-02-01
C語言
SQL語言基礎(高階查詢)
2022-11-19
SQL
Go語言核心36講（Go語言進階技術四）--學習筆記
2021-10-21
Go筆記
Go語言核心36講（Go語言進階技術三）--學習筆記
2021-10-20
Go筆記

R語言學習-高階資料管理

數學函式

統計函式

字元處理函式

相關文章