如何用R語言參破每個復仇者的口頭禪？

景略集智發表於2019-03-02

原文網址 : https://flycode.co/archives/288761

盼望著，盼望著，《復聯3》終於在國內上映。《復仇者聯盟：無限戰爭》的表現也不負眾望，國內上映3天后票房即達12億元，目前豆瓣評分為8.5。

不用說你也知道，“復仇者聯盟”裡每個成員都性格迥異，所以說話用詞都有各自鮮明的特點。那他們說話都愛用哪些詞兒？

國外有幾位漫威的鐵桿粉絲把每個復仇者的說話習慣用 R 語言視覺化了出來，圖中每個詞對應的條形長度，代表了他比其他復仇者更愛說這個詞的程度。

我們可以看到，美隊老愛喊別人名字，特別是託尼（emmmmmm...）；黑豹經常唸叨一些很高大上的詞（比如朋友，國王），不像蜘蛛俠，滿嘴嗯啊個不停（比如嘿，啊，呃），還跟個孩子似的；浩克和鷹眼說的最多的是黑寡婦，不過兩人喊得稱呼卻不同（原因你猜）；幻視和緋紅女巫很有共同話題，所以這是倆人互生愛慕的原因？果然，雷神唸叨最多的還是老弟洛基，而且老是想著“宇宙大事”，說的話都和第三部《無限戰爭》緊密相關；至於洛基嘛，意料之中的經常嗶嗶“權力”“王位”這些，但是跟洛基一樣也渴望權力的奧創卻說話不一樣，人家說的詞就很有詩意。

這麼有意思的視覺化圖形是怎麼做出來的呢？祕笈如下：

首先我們會用到以下 R 語言包：

library(dplyr)
library(grid)
library(gridExtra)
library(ggplot2)
library(reshape2)
library(cowplot)
library(jpeg)
library(extrafont)
複製程式碼

有些人可能認為使用“清除所有”程式碼行很不好，但是在指令碼頂部用它可以確保在執行指令碼時，指令碼不會依賴不小心遺留在工作區內的任何物件。

rm(list = ls())
複製程式碼

這是包含所有復仇者影象的資料夾：

dir_images <- "C:\\Users\\Matt\\Documents\\R\\Avengers"
setwd(dir_images)
複製程式碼

設定字型

windowsFonts(Franklin=windowsFont("Franklin Gothic Demi"))
複製程式碼

各個復仇者名字的簡化版

character_names <- c("black_panther","black_widow","bucky","captain_america",
                    "falcon","hawkeye","hulk","iron_man",
                    "loki","nick_fury","rhodey","scarlet_witch",
                    "spiderman","thor","ultron","vision")
image_filenames <- paste0(character_names, ".jpg")
複製程式碼

讀取和簡化版復仇者名字對應的影象檔案的函式

read_image <- function(filename){
 char_name <- gsub(pattern = "\\.jpg$", "", filename)
 img <- jpeg::readJPEG(filename)
 return(img)
}
複製程式碼

將所有影象讀取為一個列表

all_images <- lapply(image_filenames, read_image)
複製程式碼

為這列影象分配名字，這樣後面就可以被字元檢索到了

names(all_images) <- character_names
複製程式碼

其實使用影象名字很簡單，比如下面這個例子

# clear the plot window
grid.newpage()
# draw to the plot window
grid.draw(rasterGrob(all_images[['vision']]))
複製程式碼

獲取文字資料這幾位漫威粉並沒有將他們自己的電影臺詞資料集分享出來，不過我們可以在 IMSDB 上下載，然後用文字分析技術稍作處理。如果原作者後面將自己的資料集公開，我們會第一時間分享。

載入本地資料集。

修正人物名字的大小寫

capitalize <- Vectorize(function(string){
 substr(string,1,1) <- toupper(substr(string,1,1))
 return(string)
})

proper_noun_list <- c("clint","hydra","steve","tony",
                     "sam","stark","strucker","nat","natasha",
                     "hulk","tesseract", "vision",
                     "loki","avengers","rogers", "cap", "hill")

# Run the capitalization function
word_data <- word_data %>%
 mutate(word = ifelse(word %in% proper_noun_list, capitalize(word), word)) %>%
 mutate(word = ifelse(word == "jarvis", "JARVIS", word))
複製程式碼

注意前面的簡化版人物名字，不要匹配文字資料框中已經處理好格式的人物名字。

unique(word_data$Speaker)
##  [1] "Black Panther"   "Black Widow"     "Bucky"
##  [4] "Captain America" "Falcon"          "Hawkeye"
##  [7] "Hulk"            "Iron Man"        "Loki"
## [10] "Nick Fury"       "Rhodey"          "Scarlet Witch"
## [13] "Spiderman"       "Thor"            "Ultron"
## [16] "Vision"
複製程式碼

製作一個查詢表，將簡寫的檔名轉換為美觀的人物名字

character_labeler <- c(`black_panther` = "Black Panther",
                      `black_widow` = "Black Widow",
                      `bucky` = "Bucky",
                      `captain_america` = "Captain America",
                      `falcon` = "Falcon", `hawkeye` = "Hawkeye",
                      `hulk` = "Hulk", `iron_man` = "Iron Man",
                      `loki` = "Loki", `nick_fury` = "Nick Fury",
                      `rhodey` = "Rhodey",`scarlet_witch` ="Scarlet Witch",
                      `spiderman`="Spiderman", `thor`="Thor",
                      `ultron` ="Ultron", `vision` ="Vision")
複製程式碼

獲得兩個不同版本的人物名字

其中一個版本用來展示（因為美觀），另一個版本用於簡單的組織和引用影象檔案（因為簡單）。

convert_pretty_to_simple <- Vectorize(function(pretty_name){
 # pretty_name = "Vision"
 simple_name <- names(character_labeler)[character_labeler==pretty_name]
 # simple_name <- as.vector(simple_name)
 return(simple_name)
})
# convert_pretty_to_simple(c("Vision","Thor"))
# just for fun, the inverse of that function
convert_simple_to_pretty <- function(simple_name){
 # simple_name = "vision"
 pretty_name <- character_labeler[simple_name] %>% as.vector()
 return(pretty_name)
}
# example
convert_simple_to_pretty(c("vision","black_panther"))
## [1] "Vision"        "Black Panther"

複製程式碼

為文字資料框新增簡化版人物名字。

word_data$character <- convert_pretty_to_simple(word_data$Speaker)
複製程式碼

為每個人物分配一個主要顏色。

character_palette <- c(`black_panther` = "#51473E",
                      `black_widow` = "#89B9CD",
                      `bucky` = "#6F7279",
                      `captain_america` = "#475D6A",
                      `falcon` = "#863C43", `hawkeye` = "#84707F",
                      `hulk` = "#5F5F3F", `iron_man` = "#9C2728",
                      `loki` = "#3D5C25", `nick_fury` = "#838E86",
                      `rhodey` = "#38454E",`scarlet_witch` ="#620E1B",
                      `spiderman`="#A23A37", `thor`="#323D41",
                      `ultron` ="#64727D", `vision` ="#81414F" )
複製程式碼

製作水平方向的條形圖

avengers_bar_plot <- word_data %>%
 group_by(Speaker) %>%
 top_n(5, amount) %>%
 ungroup() %>%
 mutate(word = reorder(word, amount)) %>%
 ggplot(aes(x = word, y = amount, fill = character))+
 geom_bar(stat = "identity", show.legend = FALSE)+
 scale_fill_manual(values = character_palette)+
 scale_y_continuous(name ="Log Odds of Word",
                    breaks = c(0,1,2)) +
 theme(text = element_text(family = "Franklin"),
       # axis.title.x = element_text(size = rel(1.5)),
       panel.grid = element_line(colour = NULL),
       panel.grid.major.y = element_blank(),
       panel.grid.minor = element_blank(),
       panel.background = element_rect(fill = "white",
                                   colour = "white"))+
 # theme(strip.text.x = element_text(size = rel(1.5)))+
 xlab("")+
 coord_flip()+
 facet_wrap(~Speaker, scales = "free_y")
avengers_bar_plot
複製程式碼

看起來很不錯。

但是我們想畫個更酷炫的圖：用每個復仇者的照片來填充條形圖。

也就是說我們只在條形圖區域內展示出復仇者的照片，在條形區域以外的地方則不展示（如下圖所示）。

如果想做到這點，我們需要顯示一個透明的條形，然後在條形的末尾畫一個白色的條形，延伸至影象邊緣覆蓋人物照片的剩餘部分。

在資料框部分，我們現在想用所需的值的餘數來補充數字值，以實現整體最大化，這樣當把值和餘數相加時，所有數值都會增加到同一最大數值，以同樣的格式將不同行組合到一起。

max_amount <- max(word_data$amount)
word_data$remainder <- (max_amount - word_data$amount) + 0.2
複製程式碼

只提取每個復仇者說的最多的5個詞

word_data_top5 <- word_data %>%
 group_by(character) %>%
 arrange(desc(amount)) %>%
 slice(1:5) %>%
 ungroup()
複製程式碼

將數量&餘數轉換為長格式

這樣能保證每個人物和所說詞語的匹配關係有兩個 entry，一個用以真實數量（“amount”），一個用以選擇在哪裡結束，達到常見的最大值（“remainder”）。

這會將“amount”和“remainder”重疊為一個單獨的列稱為“variable”，表示是什麼值，而另一個列“value”包含來自這些值中每一個值的數字。

word_data_top5_m <- melt(word_data_top5, measure.vars = c("amount","remainder"))
複製程式碼

Variable 是一個值是真實數量還是補充數量的標記。

現在我們按順序將它們放在一起，和在melt函式中的確定它們的順序相反。否則“amount”和“remainder”會以相反的順序展現在圖形中。

word_data_top5_m$variable2 <- factor(word_data_top5_m$variable,
                                    levels = rev(levels(word_data_top5_m$variable)))
複製程式碼

為一個人物展示前 5 個詞語資料的函式

以簡單的形式宣告人物名字，比如用 black_panther 而不是 Black Panther。

plot_char <- function(character_name){
 # example: character_name = "black_panther"
 # plot details that we might want to fiddle with
 # thickness of lines between bars
 bar_outline_size <- 0.5
 # transparency of lines between bars
 bar_outline_alpha <- 0.25
 #
 # The function takes the simple character name,
 # but here, we convert it to the pretty name,
 # because we'll want to use that on the plot.
 pretty_character_name <- convert_simple_to_pretty(character_name)

 # Get the image for this character,
 # from the list of all images.
 temp_image <- all_images[character_name]

 # Make a data frame for only this character
 temp_data <- word_data_top5_m %>%
   dplyr::filter(character == character_name) %>%
   mutate(character = character_name)

 # order the words by frequency
 # First, make an ordered vector of the most common words
 # for this character
   ordered_words <- temp_data %>%
     mutate(word = as.character(word)) %>%
     dplyr::filter(variable == "amount") %>%
     arrange(value) %>%
     `[[`(., "word")

   # order the words in a factor,
   # so that they plot in this order,
   # rather than alphabetical order
   temp_data$word = factor(temp_data$word, levels = ordered_words)

 # Get the max value,
 # so that the image scales out to the end of the longest bar
 max_value <- max(temp_data$value)
 fill_colors <- c(`remainder` = "white", `value` = "white")

 # Make a grid object out of the character's image
 character_image <- rasterGrob(all_images[[character_name]],
                               width = unit(1,"npc"),
                               height = unit(1,"npc"))

 # make the plot for this character
 output_plot <- ggplot(temp_data)+
   aes(x = word, y = value, fill = variable2)+
   # add image
   # draw it completely bottom to top (x),
   # and completely from left to the the maximum log-odds value (y)
   # note that x and y are flipped here,
   # in prep for the coord_flip()
   annotation_custom(character_image,
                     xmin = -Inf, xmax = Inf, ymin = 0, ymax = max_value) +
   geom_bar(stat = "identity", color = alpha("white", bar_outline_alpha),
            size = bar_outline_size, width = 1)+
   scale_fill_manual(values = fill_colors)+
   theme_classic()+
   coord_flip(expand = FALSE)+
   # use a facet strip,
   # to serve as a title, but with color
   facet_grid(. ~ character, labeller = labeller(character = character_labeler))+
   # figure out color swatch for the facet strip fill
   # using character name to index the color palette
   # color= NA means there's no outline color.
   theme(strip.background = element_rect(fill = character_palette[character_name],
                                         color = NA))+
   # other theme elements
   theme(strip.text.x = element_text(size = rel(1.15), color = "white"),
         text = element_text(family = "Franklin"),
         legend.position = "none",
         panel.grid = element_blank(),
         axis.text.x = element_text(size = rel(0.8)))+
   # omit the axis title for the individual plot,
   # because we'll have one for the entire ensemble
   theme(axis.title = element_blank())
 return(output_plot)
}
複製程式碼

將 X 軸名稱用為所有復仇者主影象的名稱

plot_x_axis_text <- paste("Tendency to use this word more than other characters do",
                         "(units of log odds ratio)", sep = "\n")
複製程式碼

下面是函式在這裡的工作示例

sample_plot <- plot_char("black_panther")+
 theme(axis.title = element_text())+
 # x lab is still declared as y lab
 # because of coord_flip()
 ylab(plot_x_axis_text)
sample_plot
複製程式碼

為何我們這裡的水平軸上還帶著非常奇怪的“對數差異比”？

因為隨著數字增大，差異也會隨之增大（具體數學知識這裡不再講述）；將它們轉換為對數尺度，可以約束變化幅度的大小，方便我們在螢幕上展示。

如果想將這些對數差異轉化為簡單的概率形式，可以用如下函式：

logit2prob <- function(logit){
 odds <- exp(logit)
 prob <- odds / (1 + odds)
 return(prob)
}
複製程式碼

這樣處理後水平軸會如下所示：

logit2prob(seq(0, 2.5, 0.5))
## [1] 0.5000000 0.6224593 0.7310586 0.8175745 0.8807971 0.9241418
複製程式碼

注意此序列中連續專案之間的差異在慢慢消失：

diff(logit2prob(seq(0, 2.5, 0.5)))
## [1] 0.12245933 0.10859925 0.08651590 0.06322260 0.04334474
複製程式碼

Okay，現在我們製作出了一個圖···

我們接著將函式應用到列表中所有復仇者身上，將所有繪圖放入一個列表物件。

all_plots <- lapply(character_names, plot_char)
複製程式碼

從繪圖中提取軸名稱的函式

不僅僅是文字，還有其它畫出的資訊。

你可以選擇提取 X 軸名稱還是 Y 軸名稱：

get_axis_grob <- function(plot_to_pick, which_axis){
 # plot_to_pick <- sample_plot
 tmp <- ggplot_gtable(ggplot_build(plot_to_pick))
 # tmp$grobs
 # find the grob that looks like
 # it would be the x axis
 axis_x_index <- which(sapply(tmp$grobs, function(x){
   # for all the grobs,
   # return the index of the one
   # where you can find the text
   # "axis.title.x" or "axis.title.y"
   # based on input argument `which_axis`
   grepl(paste0("axis.title.",which_axis), x)}
 ))
 axis_grob <- tmp$grobs[[axis_x_index]]
 return(axis_grob)
} 
複製程式碼

提取軸名稱 Grob

px_axis_x <- get_axis_grob(sample_plot, "x")
px_axis_y <- get_axis_grob(sample_plot, "y")
複製程式碼

下面是使用這些提取的軸的方法：

grid.newpage()
grid.draw(px_axis_x) 
複製程式碼

將所有繪圖排成一個物件

big_plot <- arrangeGrob(grobs = all_plots)
複製程式碼

將 X 軸嵌入繪圖的底部，因為每個圖並沒有 X 軸，而我們想讓它們都有 X 軸。

注意這時繪圖會看著很不協調，高度差不多是寬度的十倍。

big_plot_w_x_axis_title <- arrangeGrob(big_plot,
                                      px_axis_x,
                                      heights = c(10,1))
grid.newpage()
grid.draw(big_plot_w_x_axis_title)
複製程式碼

繪圖所佔的空間大小不一，因為每個圖的詞彙長度不同。

這樣看起來有些混亂。

通常我們會用 facet_grid() 或 facet_wrap() 來確保繪圖整潔有序，但這裡卻不能使用因為每個圖的背景圖各不相同，無法像資料框中的其它列一樣對映到平面上（因為背景影象實際上並非資料框的一部分）。

使用 cowplot 而非 arrangeGrob

這樣繪圖的軸會垂直對齊：

big_plot_aligned <- cowplot::plot_grid(plotlist = all_plots, align = 'v', nrow = 4)
複製程式碼

和之前一樣，將X軸名稱新增至繪圖對齊後網格的下方。

big_plot_w_x_axis_title_aligned <- arrangeGrob(big_plot_aligned,
                                              px_axis_x,
                                              heights = c(10,1))
複製程式碼

下面是將整體效果圖繪製在螢幕上的方法：

grid.newpage()
grid.draw(big_plot_w_x_axis_title_aligned)
複製程式碼

很好！

儲存最終影象：

ggsave(big_plot_w_x_axis_title_aligned,
      file = "Avengers_Word_Usage.png",
      width = 12, height = 6.3)
複製程式碼

這樣，我們就視覺化出了《復聯》中各個復仇者都最愛說那些話！

如何用python語言發出乘法口訣表
2021-09-11
Python
r語言
2019-10-18
R語言
【R語言入門】R語言環境搭建
2021-09-09
R語言
R 語言使用
2024-06-10
本週11個大語言模型工具如MindGraph
2024-03-24
模型
R語言的入門教程
2023-12-21
R語言
也談SpringCloud：假如《復仇者聯盟4》是一個微服務
2019-04-10
SpringGCCloud微服務
這位《復仇者聯盟》系列的幕後英雄，也曾參與過遊戲大作的開發
2019-04-26
遊戲
R語言的初級學習
2024-04-27
R語言
python和r語言的區別
2021-09-11
PythonR語言
C語言形參和實參
2024-10-01
C語言
R語言快速入門
2018-09-18
R語言
R語言：畫樹圖
2018-09-22
R語言
什麼是r語言
2019-10-16
R語言
R語言函式-tolower
2020-11-17
R語言函式
《復仇者聯盟4》觀感：不推薦
2019-04-28
社交網路分析的 R 基礎：（一）初探 R 語言
2022-02-05
遊戲基礎知識——“復仇者”角色的設計手法
2019-12-06
遊戲
R語言-Survival analysis（生存分析）
2019-05-31
R語言
【R語言入門】R語言中的變數與基本資料型別
2020-11-28
R語言變數資料型別
R語言有多強大？十個你不知道的功能
2019-03-29
R語言
硬核觀察 #795 首個可以翻譯無文字語言的口語翻譯 AI
2022-10-20
AI
python和R語言哪個好?哪個適合資料分析？
2020-09-29
PythonR語言
每個程式設計師都應該知道的下一個程式語言——Kotlin
2018-11-09
程式設計師Kotlin
R語言安裝多個版本和多版本Rstudio管理
2024-11-01
R語言
《復仇者聯盟4》預告片24小時瀏覽量達2.89億次破影史記錄
2018-12-09
【ChatGPT】每個程式設計師百寶箱必備的語言模型
2023-03-10
ChatGPT程式設計師模型
[譯] 如果把程式語言看作國家，每種語言分別代表哪個國家？
2018-12-08
使用R語言分析微信好友
2018-10-05
R語言
R語言批量建立資料框
2018-09-28
R語言
Lasso迴歸及其R語言操作
2024-04-08
R語言
R語言資料質量分析
2024-03-21
R語言
Python真的勒死R語言了嗎？
2019-08-25
PythonR語言
R語言經典統計分析
2024-10-16
R語言
如何用 4 行 R 語句，快速探索你的資料集？
2018-07-23
如何用最短的時間學會C語言，並掌握C語言的精髓所在？
2021-01-29
C語言
如何用vue-router為每個路由配置各自的title
2018-07-30
Vue路由
如何用python輸出每10個一行的元素
2020-11-05
Python