gdb 除錯入門，大牛寫的高質量指南

逆旅發表於2016-11-23

沒想到Brendan Gregg這樣的大牛，會寫出這樣一篇gdb tutorials文章：gdb Debugging Full Example (Tutorial): ncurses 。但可能正如文章開頭所說，大牛對網上的gdb文章都不太滿意，所以才有了這篇高質量指南，gdb入門者的福音。—— 何登成

如果你是系統管理員，但還不認識 Brendan Gregg，那網上流傳甚廣的 3 張 Linux 效能工具圖（連結），你應該看過的。—— 伯小樂。

（ Brendan Gregg）

gdb 除錯 ncurses 全過程：

發現網上的“gdb 示例”只有命令而沒有對應的輸出，我有點不滿意。gdb 是 GNU 偵錯程式，Linux 上的標配偵錯程式。當我看 Greg Law 在 CppCon 2015 上的演講《給我 15 分鐘，我將改變你的對 GDB 的認知》的時候，我想起了示例輸出的不足，幸運的是，這次有輸出！這 15 分鐘太值了。

它也啟發我去分享一個完整的 gdb 除錯例項，包含輸出和每個步驟，甚至鑽牛角尖的情況。這不是一個特別有趣或奇怪的問題，只是常規的 gdb 除錯會話。但它包含了基礎的東西可以勉強作為教程使用，記住 gdb 裡還有很多東西我這裡沒用到。

我會以 root 許可權執行下面的命令，因為我在除錯一個工具，它需要 root 許可權（目前）。需要的時候可用 sudo 獲取 root 許可權。你也沒必要通讀全篇︰我已列出每一步，你可以瀏覽它們找感興趣的看。

1. 問題概述

BPF 工具箱裡的 bcc 工具集有一個對cachetop.py 的 pull 請求，它通過程式使用 top-like display 顯示 page cache 的統計。太好了！然而，當我測試它時，遇到了段錯誤︰

# ./cachetop.py
Segmentation fault

1 2	# ./cachetop.py Segmentation fault

注意它說的是“段錯誤”，不是“段錯誤（核心已轉儲）”。我想要一個核心轉儲檔案用來除錯。（核心轉儲檔案是程式記憶體的拷貝 – 這個名字來源於磁芯儲存器時代 – 可用偵錯程式分析）

分析核心轉儲檔案是一種方法，但不是除錯這個問題的唯一方法。我可以在 gdb 中執行此程式，來檢查這個問題。我也可以在段錯誤發生時，用外部追蹤器去抓資料和棧幀。我們從核心轉儲檔案入手。

2. 解決核心轉儲問題

我檢查一下核心轉儲的設定：

# ulimit -c
0
# cat /proc/sys/kernel/core_pattern
core

# ulimit -c

# cat /proc/sys/kernel/core_pattern

core

ulimit -c 顯示核心轉儲檔案大小的最大值，這裡是零：禁止核心轉儲（對於本程式和它的子程式）。

/proc/…/core_pattern 僅僅被設為 “core”，表示會在當前目錄下生成一個檔名為 “core” 的核心轉儲檔案。目前這樣就行了，但是我要演示如何把它設定為全域性位置。

# ulimit -c unlimited
# mkdir /var/cores
# echo "/var/cores/core.%e.%p" > /proc/sys/kernel/core_pattern

# ulimit -c unlimited

# mkdir /var/cores

# echo "/var/cores/core.%e.%p" > /proc/sys/kernel/core_pattern

你可以進一步定製 core_pattern；例如，%h 為主機名，%t 為轉儲的時間。這些選項被寫在 Linux 核心原始碼 Documentation/sysctl/kernel.txt中。

要使 core_pattern 保持不變，重啟之後仍然有效，你可以通過設定 /etc/sysctl.conf 裡的 “kernel.core_pattern” 實現。

再來一次：

# ./cachetop.py
Segmentation fault (core dumped)
# ls -lh /var/cores
total 19M
-rw------- 1 root root 20M Aug  7 22:15 core.python.30520
# file /var/cores/core.python.30520 
/var/cores/core.python.30520: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'python ./cachetop.py'

# ./cachetop.py

Segmentation fault (core dumped)

# ls -lh /var/cores

total 19M

-rw------- 1 root root 20M Aug 7 22:15 core.python.30520

# file /var/cores/core.python.30520

/var/cores/core.python.30520: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'python ./cachetop.py'

好多了：我們有了自己的核心轉儲檔案。

3. 啟動 GDB

現在我要用 gdb 啟動目標程式（用 shell 替換符，”`”，不過在你確定能用的情況下，也可指定完整路徑），和核心轉儲檔案：

# gdb `which python` /var/cores/core.python.30520
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/python...(no debugging symbols found)...done.
warning: core file may not match specified executable file.
[New LWP 30520]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
warning: JITed object file architecture unknown is not compatible with target architecture i386:x86-64.
Core was generated by `python ./cachetop.py'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f0a37aac40d in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5

# gdb `which python` /var/cores/core.python.30520

GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1

License GPLv3+: GNU GPL version 3 or later

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law. Type "show copying"

and "show warranty" for details.

This GDB was configured as "x86_64-linux-gnu".

Type "show configuration" for configuration details.

For bug reporting instructions, please see:

Find the GDB manual and other documentation resources online at:

For help, type "help".

Type "apropos word" to search for commands related to "word"...

Reading symbols from /usr/bin/python...(no debugging symbols found)...done.

warning: core file may not match specified executable file.

[New LWP 30520]

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

warning: JITed object file architecture unknown is not compatible with target architecture i386:x86-64.

Core was generated by `python ./cachetop.py'.

Program terminated with signal SIGSEGV, Segmentation fault.

#0 0x00007f0a37aac40d in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5

最後兩行很有趣：它告訴我們這個段錯誤發生在 libncursesw 庫裡 doupdate() 函式中。可以先在網上搜一下，以防這是個很常見的問題。我搜了一下，可是沒發現一個常見的原因。

我已經猜到 libncursesw 是什麼了，如果你對它很陌生，它在 “/lib” 目錄下以 “.so.*” 結尾表明這是一個動態庫檔案，可能有 man 手冊、網站、包描述等。

# dpkg -l | grep libncursesw
ii  libncursesw5:amd64                  6.0+20160213-1ubuntu1                    amd64
     shared libraries for terminal handling (wide character support)

# dpkg -l | grep libncursesw

ii libncursesw5:amd64 6.0+20160213-1ubuntu1 amd64

shared libraries for terminal handling (wide character support)

我是碰巧在 Ubuntu 上除錯，但用什麼 Linux發行版對使用 gdb 並沒有影響。

4. 回溯

棧回溯顯示我們是如何到達失敗點的，通常足夠幫助我們確定常見的問題。bt （backtrace的簡寫）常常是我在 gdb 中使用的第一條命令：

(gdb) bt
#0  0x00007f0a37aac40d in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5
#1  0x00007f0a37aa07e6 in wrefresh () from /lib/x86_64-linux-gnu/libncursesw.so.5
#2  0x00007f0a37a99616 in ?? () from /lib/x86_64-linux-gnu/libncursesw.so.5
#3  0x00007f0a37a9a325 in wgetch () from /lib/x86_64-linux-gnu/libncursesw.so.5
#4  0x00007f0a37cc6ec3 in ?? () from /usr/lib/python2.7/lib-dynload/_curses.x86_64-linux-gnu.so
#5  0x00000000004c4d5a in PyEval_EvalFrameEx ()
#6  0x00000000004c2e05 in PyEval_EvalCodeEx ()
#7  0x00000000004def08 in ?? ()
#8  0x00000000004b1153 in PyObject_Call ()
#9  0x00000000004c73ec in PyEval_EvalFrameEx ()
#10 0x00000000004c2e05 in PyEval_EvalCodeEx ()
#11 0x00000000004caf42 in PyEval_EvalFrameEx ()
#12 0x00000000004c2e05 in PyEval_EvalCodeEx ()
#13 0x00000000004c2ba9 in PyEval_EvalCode ()
#14 0x00000000004f20ef in ?? ()
#15 0x00000000004eca72 in PyRun_FileExFlags ()
#16 0x00000000004eb1f1 in PyRun_SimpleFileExFlags ()
#17 0x000000000049e18a in Py_Main ()
#18 0x00007f0a3be10830 in __libc_start_main (main=0x49daf0 <main>, argc=2, argv=0x7ffd33d94838, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7ffd33d94828) at ../csu/libc-start.c:291
#19 0x000000000049da19 in _start ()

(gdb) bt

#0 0x00007f0a37aac40d in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5

#1 0x00007f0a37aa07e6 in wrefresh () from /lib/x86_64-linux-gnu/libncursesw.so.5

#2 0x00007f0a37a99616 in ?? () from /lib/x86_64-linux-gnu/libncursesw.so.5

#3 0x00007f0a37a9a325 in wgetch () from /lib/x86_64-linux-gnu/libncursesw.so.5

#4 0x00007f0a37cc6ec3 in ?? () from /usr/lib/python2.7/lib-dynload/_curses.x86_64-linux-gnu.so

#5 0x00000000004c4d5a in PyEval_EvalFrameEx ()

#6 0x00000000004c2e05 in PyEval_EvalCodeEx ()

#7 0x00000000004def08 in ?? ()

#8 0x00000000004b1153 in PyObject_Call ()

#9 0x00000000004c73ec in PyEval_EvalFrameEx ()

#10 0x00000000004c2e05 in PyEval_EvalCodeEx ()

#11 0x00000000004caf42 in PyEval_EvalFrameEx ()

#12 0x00000000004c2e05 in PyEval_EvalCodeEx ()

#13 0x00000000004c2ba9 in PyEval_EvalCode ()

#14 0x00000000004f20ef in ?? ()

#15 0x00000000004eca72 in PyRun_FileExFlags ()

#16 0x00000000004eb1f1 in PyRun_SimpleFileExFlags ()

#17 0x000000000049e18a in Py_Main ()

#18 0x00007f0a3be10830 in __libc_start_main (main=0x49daf0 <main>, argc=2, argv=0x7ffd33d94838, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,

stack_end=0x7ffd33d94828) at ../csu/libc-start.c:291

#19 0x000000000049da19 in _start ()

從下往上，按照從父函式到子函式的順序看。有 “??” 的地方是因為符號解析失敗。遍歷棧 – 用來生成棧幀 — 也會失敗。在這種情況下你可能會看到一個正常的棧幀，跟著一個小數值的假地址。如果符號或棧破損很嚴重，導致無法理解棧回溯，這裡有幾個常用的辦法來修復：安裝 debug info 包（給 gdb 提供更多的符號，讓它來做基於 DWARF 的棧遍歷），或者重新用原始碼編譯（-fno-omit-frame-pointer -g）一個帶幀指標和除錯資訊的版本。以上大多數 “??”
可以通過安裝 python-dbg 包來修復。

這些棧看起來不太有用：幀 5 到 17 （左邊的索引）在 Python 內部，雖然還看不到 Python 方法。幀 4 是 _curses 庫，然後就到了 libncursesw。看起來呼叫順序是 wgetch()->wrefresh()->doupdate()。根據函式名來看，我猜是重新整理視窗。為什麼會導致核心轉儲呢？

5. 反彙編

我從反彙編發生段錯誤的函式 doupdate() 開始：

(gdb) disas doupdate
Dump of assembler code for function doupdate:
   0x00007f0a37aac2e0 <+0>:   push   %r15
   0x00007f0a37aac2e2 <+2>:   push   %r14
   0x00007f0a37aac2e4 <+4>:   push   %r13
   0x00007f0a37aac2e6 <+6>:   push   %r12
   0x00007f0a37aac2e8 <+8>:   push   %rbp
   0x00007f0a37aac2e9 <+9>:   push   %rbx
   0x00007f0a37aac2ea <+10>:  sub    $0xc8,%rsp
[...]
---Type <return> to continue, or q <return> to quit---
[...]
   0x00007f0a37aac3f7 <+279>: cmpb   $0x0,0x21(%rcx)
   0x00007f0a37aac3fb <+283>: je     0x7f0a37aacc3b <doupdate+2395>
   0x00007f0a37aac401 <+289>: mov    0x20cb68(%rip),%rax        # 0x7f0a37cb8f70
   0x00007f0a37aac408 <+296>: mov    (%rax),%rsi
   0x00007f0a37aac40b <+299>: xor    %eax,%eax
=> 0x00007f0a37aac40d <+301>: mov    0x10(%rsi),%rdi
   0x00007f0a37aac411 <+305>: cmpb   $0x0,0x1c(%rdi)
   0x00007f0a37aac415 <+309>: jne    0x7f0a37aac6f7 <doupdate+1047>
   0x00007f0a37aac41b <+315>: movswl 0x4(%rcx),%ecx
   0x00007f0a37aac41f <+319>: movswl 0x74(%rdx),%edi
   0x00007f0a37aac423 <+323>: mov    %rax,0x40(%rsp)
[...]

(gdb) disas doupdate

Dump of assembler code for function doupdate:

0x00007f0a37aac2e0 <+0>: push %r15

0x00007f0a37aac2e2 <+2>: push %r14

0x00007f0a37aac2e4 <+4>: push %r13

0x00007f0a37aac2e6 <+6>: push %r12

0x00007f0a37aac2e8 <+8>: push %rbp

0x00007f0a37aac2e9 <+9>: push %rbx

0x00007f0a37aac2ea <+10>: sub $0xc8,%rsp

[...]

---Type <return> to continue, or q <return> to quit---

[...]

0x00007f0a37aac3f7 <+279>: cmpb $0x0,0x21(%rcx)

0x00007f0a37aac3fb <+283>: je 0x7f0a37aacc3b <doupdate+2395>

0x00007f0a37aac401 <+289>: mov 0x20cb68(%rip),%rax # 0x7f0a37cb8f70

0x00007f0a37aac408 <+296>: mov (%rax),%rsi

0x00007f0a37aac40b <+299>: xor %eax,%eax

=> 0x00007f0a37aac40d <+301>: mov 0x10(%rsi),%rdi

0x00007f0a37aac411 <+305>: cmpb $0x0,0x1c(%rdi)

0x00007f0a37aac415 <+309>: jne 0x7f0a37aac6f7 <doupdate+1047>

0x00007f0a37aac41b <+315>: movswl 0x4(%rcx),%ecx

0x00007f0a37aac41f <+319>: movswl 0x74(%rdx),%edi

0x00007f0a37aac423 <+323>: mov %rax,0x40(%rsp)

[...]

部分輸出。（我也可以只輸入 “disas” 它會預設反彙編 doupdate ）

“=>” 指向段錯誤地址，此處是一條 mov 指令 mov 0x10(%rsi),%rdi：從%rsi中指向記憶體地址的值加偏移量 0x10 處取值，送到 %rdi 暫存器中。接下來我會檢查暫存器的狀態。

6. 檢視暫存器

使用 i r（info registers 的簡寫）列印暫存器值：

(gdb) i r
rax            0x0  0
rbx            0x1993060    26816608
rcx            0x19902a0    26804896
rdx            0x19ce7d0    27060176
rsi            0x0  0
rdi            0x19ce7d0    27060176
rbp            0x7f0a3848eb10   0x7f0a3848eb10 <SP>
rsp            0x7ffd33d93c00   0x7ffd33d93c00
r8             0x7f0a37cb93e0   139681862489056
r9             0x0  0
r10            0x8  8
r11            0x202    514
r12            0x0  0
r13            0x0  0
r14            0x7f0a3848eb10   139681870703376
r15            0x19ce7d0    27060176
rip            0x7f0a37aac40d   0x7f0a37aac40d <doupdate+301>
eflags         0x10246  [ PF ZF IF RF ]
cs             0x33 51
ss             0x2b 43
ds             0x0  0
es             0x0  0
fs             0x0  0
gs             0x0  0

(gdb) i r

rax 0x0 0

rbx 0x1993060 26816608

rcx 0x19902a0 26804896

rdx 0x19ce7d0 27060176

rsi 0x0 0

rdi 0x19ce7d0 27060176

rbp 0x7f0a3848eb10 0x7f0a3848eb10 <SP>

rsp 0x7ffd33d93c00 0x7ffd33d93c00

r8 0x7f0a37cb93e0 139681862489056

r9 0x0 0

r10 0x8 8

r11 0x202 514

r12 0x0 0

r13 0x0 0

r14 0x7f0a3848eb10 139681870703376

r15 0x19ce7d0 27060176

rip 0x7f0a37aac40d 0x7f0a37aac40d <doupdate+301>

eflags 0x10246 [ PF ZF IF RF ]

cs 0x33 51

ss 0x2b 43

ds 0x0 0

es 0x0 0

fs 0x0 0

gs 0x0 0

哦，%rsi是零，這就是我們的問題所在！零不太可能是有效地址，並且解引用一個未初始化的指標或空指標引起的段錯誤是常見的軟體 bug。

7. 記憶體對映

你可以使用 i proc m （info proc mappings 的簡寫）核查零是不是有效地址：

(gdb) i proc m
Mapped address spaces:

      Start Addr           End Addr       Size     Offset objfile
        0x400000           0x6e7000   0x2e7000        0x0 /usr/bin/python2.7
        0x8e6000           0x8e8000     0x2000   0x2e6000 /usr/bin/python2.7
        0x8e8000           0x95f000    0x77000   0x2e8000 /usr/bin/python2.7
  0x7f0a37a8b000     0x7f0a37ab8000    0x2d000        0x0 /lib/x86_64-linux-gnu/libncursesw.so.5.9
  0x7f0a37ab8000     0x7f0a37cb8000   0x200000    0x2d000 /lib/x86_64-linux-gnu/libncursesw.so.5.9
  0x7f0a37cb8000     0x7f0a37cb9000     0x1000    0x2d000 /lib/x86_64-linux-gnu/libncursesw.so.5.9
  0x7f0a37cb9000     0x7f0a37cba000     0x1000    0x2e000 /lib/x86_64-linux-gnu/libncursesw.so.5.9
  0x7f0a37cba000     0x7f0a37ccd000    0x13000        0x0 /usr/lib/python2.7/lib-dynload/_curses.x86_64-linux-gnu.so
  0x7f0a37ccd000     0x7f0a37ecc000   0x1ff000    0x13000 /usr/lib/python2.7/lib-dynload/_curses.x86_64-linux-gnu.so
  0x7f0a37ecc000     0x7f0a37ecd000     0x1000    0x12000 /usr/lib/python2.7/lib-dynload/_curses.x86_64-linux-gnu.so
  0x7f0a37ecd000     0x7f0a37ecf000     0x2000    0x13000 /usr/lib/python2.7/lib-dynload/_curses.x86_64-linux-gnu.so
  0x7f0a38050000     0x7f0a38066000    0x16000        0x0 /lib/x86_64-linux-gnu/libgcc_s.so.1
  0x7f0a38066000     0x7f0a38265000   0x1ff000    0x16000 /lib/x86_64-linux-gnu/libgcc_s.so.1
  0x7f0a38265000     0x7f0a38266000     0x1000    0x15000 /lib/x86_64-linux-gnu/libgcc_s.so.1
  0x7f0a38266000     0x7f0a3828b000    0x25000        0x0 /lib/x86_64-linux-gnu/libtinfo.so.5.9
  0x7f0a3828b000     0x7f0a3848a000   0x1ff000    0x25000 /lib/x86_64-linux-gnu/libtinfo.so.5.9
[...]

(gdb) i proc m

Mapped address spaces:

Start Addr End Addr Size Offset objfile

0x400000 0x6e7000 0x2e7000 0x0 /usr/bin/python2.7

0x8e6000 0x8e8000 0x2000 0x2e6000 /usr/bin/python2.7

0x8e8000 0x95f000 0x77000 0x2e8000 /usr/bin/python2.7

0x7f0a37a8b000 0x7f0a37ab8000 0x2d000 0x0 /lib/x86_64-linux-gnu/libncursesw.so.5.9

0x7f0a37ab8000 0x7f0a37cb8000 0x200000 0x2d000 /lib/x86_64-linux-gnu/libncursesw.so.5.9

0x7f0a37cb8000 0x7f0a37cb9000 0x1000 0x2d000 /lib/x86_64-linux-gnu/libncursesw.so.5.9

0x7f0a37cb9000 0x7f0a37cba000 0x1000 0x2e000 /lib/x86_64-linux-gnu/libncursesw.so.5.9

0x7f0a37cba000 0x7f0a37ccd000 0x13000 0x0 /usr/lib/python2.7/lib-dynload/_curses.x86_64-linux-gnu.so

0x7f0a37ccd000 0x7f0a37ecc000 0x1ff000 0x13000 /usr/lib/python2.7/lib-dynload/_curses.x86_64-linux-gnu.so

0x7f0a37ecc000 0x7f0a37ecd000 0x1000 0x12000 /usr/lib/python2.7/lib-dynload/_curses.x86_64-linux-gnu.so

0x7f0a37ecd000 0x7f0a37ecf000 0x2000 0x13000 /usr/lib/python2.7/lib-dynload/_curses.x86_64-linux-gnu.so

0x7f0a38050000 0x7f0a38066000 0x16000 0x0 /lib/x86_64-linux-gnu/libgcc_s.so.1

0x7f0a38066000 0x7f0a38265000 0x1ff000 0x16000 /lib/x86_64-linux-gnu/libgcc_s.so.1

0x7f0a38265000 0x7f0a38266000 0x1000 0x15000 /lib/x86_64-linux-gnu/libgcc_s.so.1

0x7f0a38266000 0x7f0a3828b000 0x25000 0x0 /lib/x86_64-linux-gnu/libtinfo.so.5.9

0x7f0a3828b000 0x7f0a3848a000 0x1ff000 0x25000 /lib/x86_64-linux-gnu/libtinfo.so.5.9

[...]

第一個有效的虛擬地址是 0x400000。任何小於它的地址都是非法的，這些地址如果被引用，就會引起段錯誤。

目前有幾種不同的方式可做進一步分析。我先一步一步的看指令。

8. 斷點

先回到反彙編：

   0x00007f0a37aac401 <+289>:   mov    0x20cb68(%rip),%rax        # 0x7f0a37cb8f70
   0x00007f0a37aac408 <+296>:   mov    (%rax),%rsi
   0x00007f0a37aac40b <+299>:   xor    %eax,%eax
=> 0x00007f0a37aac40d <+301>:   mov    0x10(%rsi),%rdi

0x00007f0a37aac401 <+289>: mov 0x20cb68(%rip),%rax # 0x7f0a37cb8f70

0x00007f0a37aac408 <+296>: mov (%rax),%rsi

0x00007f0a37aac40b <+299>: xor %eax,%eax

=> 0x00007f0a37aac40d <+301>: mov 0x10(%rsi),%rdi

看這四條指令：好像是從棧中取東西放到 %rax，然後解引用 %rax 到 %rsi，再將 %eax 置零（ xor 是一個優化，替換掉移動 0 的動作），最後將 %rsi 解引用再加一個偏移，不過我們知道 %rsi 是零。這幾條指令用來訪問資料結構。可能 %rax 會很有趣，但是它已經被前面的指令置零，所以我們在核心轉儲檔案的暫存器裡看不到它的值。

我可以在 doupdate+289 下個斷點，然後逐條指令檢視暫存器的值如何變化。首先，我需要啟動 gdb 把程式跑起來：

# gdb `which python`
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/python...(no debugging symbols found)...done.

# gdb `which python`

GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1

License GPLv3+: GNU GPL version 3 or later

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law. Type "show copying"

and "show warranty" for details.

This GDB was configured as "x86_64-linux-gnu".

Type "show configuration" for configuration details.

For bug reporting instructions, please see:

Find the GDB manual and other documentation resources online at:

For help, type "help".

Type "apropos word" to search for commands related to "word"...

Reading symbols from /usr/bin/python...(no debugging symbols found)...done.

現在用 b （break 的簡寫）來下斷點：

(gdb) b *doupdate + 289
No symbol table is loaded.  Use the "file" command.

1 2	(gdb) b *doupdate + 289 No symbol table is loaded. Use the "file" command.

哦。我想演示這個錯誤來解釋為什麼我們經常以在主函式設定斷點作為開始，因為這時候符號可能被載入，可以設定感興趣的斷點。我直接在 doupdate 函式設斷點，避開這個問題，一旦斷點被觸發就設定加了偏移的斷點。

(gdb) b doupdate
Function "doupdate" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (doupdate) pending.
(gdb) r cachetop.py
Starting program: /usr/bin/python cachetop.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
warning: JITed object file architecture unknown is not compatible with target architecture i386:x86-64.

Breakpoint 1, 0x00007ffff34ad2e0 in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5
(gdb) b *doupdate + 289
Breakpoint 2 at 0x7ffff34ad401
(gdb) c
Continuing.


Breakpoint 2, 0x00007ffff34ad401 in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5

(gdb) b doupdate

Function "doupdate" not defined.

Make breakpoint pending on future shared library load? (y or [n]) y

Breakpoint 1 (doupdate) pending.

(gdb) r cachetop.py

Starting program: /usr/bin/python cachetop.py

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

warning: JITed object file architecture unknown is not compatible with target architecture i386:x86-64.

Breakpoint 1, 0x00007ffff34ad2e0 in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5

(gdb) b *doupdate + 289

Breakpoint 2 at 0x7ffff34ad401

(gdb) c

Continuing.

Breakpoint 2, 0x00007ffff34ad401 in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5

我們到了斷點處。

如果你之前沒有做這些，r (run) 命令會把引數傳給我們早先在命令列指定的 gdb 目標（python）。這樣的話程式會以執行 “python cachetop.py” 結束。

9. 單步除錯

我跳到下一條指令（si，stepi的簡寫），然後檢查暫存器：

(gdb) si
0x00007ffff34ad408 in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5
(gdb) i r
rax            0x7ffff3e8f948   140737285519688
rbx            0xaea060 11444320
rcx            0xae72a0 11432608
rdx            0xa403d0 10748880
rsi            0x7ffff7ea8e10   140737352732176
rdi            0xa403d0 10748880
rbp            0x7ffff3e8fb10   0x7ffff3e8fb10 <SP>
rsp            0x7fffffffd390   0x7fffffffd390
r8             0x7ffff36ba3e0   140737277305824
r9             0x0  0
r10            0x8  8
r11            0x202    514
r12            0x0  0
r13            0x0  0
r14            0x7ffff3e8fb10   140737285520144
r15            0xa403d0 10748880
rip            0x7ffff34ad408   0x7ffff34ad408 <doupdate+296>
eflags         0x202    [ IF ]
cs             0x33 51
ss             0x2b 43
ds             0x0  0
es             0x0  0
fs             0x0  0
gs             0x0  0
(gdb) p/a 0x7ffff3e8f948
$1 = 0x7ffff3e8f948 <cur_term>

(gdb) si

0x00007ffff34ad408 in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5

(gdb) i r

rax 0x7ffff3e8f948 140737285519688

rbx 0xaea060 11444320

rcx 0xae72a0 11432608

rdx 0xa403d0 10748880

rsi 0x7ffff7ea8e10 140737352732176

rdi 0xa403d0 10748880

rbp 0x7ffff3e8fb10 0x7ffff3e8fb10 <SP>

rsp 0x7fffffffd390 0x7fffffffd390

r8 0x7ffff36ba3e0 140737277305824

r9 0x0 0

r10 0x8 8

r11 0x202 514

r12 0x0 0

r13 0x0 0

r14 0x7ffff3e8fb10 140737285520144

r15 0xa403d0 10748880

rip 0x7ffff34ad408 0x7ffff34ad408 <doupdate+296>

eflags 0x202 [ IF ]

cs 0x33 51

ss 0x2b 43

ds 0x0 0

es 0x0 0

fs 0x0 0

gs 0x0 0

(gdb) p/a 0x7ffff3e8f948

$1 = 0x7ffff3e8f948 <cur_term>

又一條線索。所以我們解引用的空指標好像是一個叫 “cur_term” 的符號（p/a 是 print/a 的簡寫，這裡 “/a” 指以地址的形式）。考慮到這是 ncurses, 是我們的環境變數 TERM 設定有問題嗎？

# echo $TERM
xterm-256color

1 2	# echo $TERM xterm-256color

我試過將其設定為 vt100 並執行程式，還是遇到了同樣的段錯誤。

注意我只是在 doupdate() 第一次被呼叫的時候檢視了暫存器，但是它可以被多次呼叫，所以問題可能出在後邊的呼叫中。我可以通過執行 c（ continue 的簡寫）一步步到達出問題的地方。如果它被呼叫幾次的話這樣做是可行的，如果它被呼叫幾千次的話我得用別的辦法。（我會在 15 節的裡介紹。）

10. 回退

gdb 有一個超棒的功能叫回退，Greg Law 在他的演講中提到過。這裡有一個例子。

我再啟動一個 python 會話，從頭演示：

# gdb `which python`
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/python...(no debugging symbols found)...done.

# gdb `which python`

GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1

License GPLv3+: GNU GPL version 3 or later

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law. Type "show copying"

and "show warranty" for details.

This GDB was configured as "x86_64-linux-gnu".

Type "show configuration" for configuration details.

For bug reporting instructions, please see:

<http://www.gnu.org/software/gdb/bugs/>.

Find the GDB manual and other documentation resources online at:

<http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".

Type "apropos word" to search for commands related to "word"...

Reading symbols from /usr/bin/python...(no debugging symbols found)...done.

和之前一樣我在 doupdate 下斷點，一旦觸發，我就啟動 recording，然後繼續執行程式直到崩潰。Recording 會增加相當大的開銷，所以我不想在主函式裡就將它開啟。

(gdb) b doupdate
Function "doupdate" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (doupdate) pending.
(gdb) r cachetop.py
Starting program: /usr/bin/python cachetop.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
warning: JITed object file architecture unknown is not compatible with target architecture i386:x86-64.

Breakpoint 1, 0x00007ffff34ad2e0 in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5
(gdb) record
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff34ad40d in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5

(gdb) b doupdate

Function "doupdate" not defined.

Make breakpoint pending on future shared library load? (y or [n]) y

Breakpoint 1 (doupdate) pending.

(gdb) r cachetop.py

Starting program: /usr/bin/python cachetop.py

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

warning: JITed object file architecture unknown is not compatible with target architecture i386:x86-64.

Breakpoint 1, 0x00007ffff34ad2e0 in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5

(gdb) record

(gdb) c

Continuing.

Program received signal SIGSEGV, Segmentation fault.

0x00007ffff34ad40d in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5

這裡我可以逐行或逐條指令的回退。它通過播放我們記錄的暫存器狀態來工作。我回退兩條指令，然後列印暫存器值：

(gdb) reverse-stepi
0x00007ffff34ad40d in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5
(gdb) reverse-stepi
0x00007ffff34ad40b in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5
(gdb) i r
rax            0x7ffff3e8f948   140737285519688
rbx            0xaea060 11444320
rcx            0xae72a0 11432608
rdx            0xa403d0 10748880
rsi            0x0  0
rdi            0xa403d0 10748880
rbp            0x7ffff3e8fb10   0x7ffff3e8fb10 <SP>
rsp            0x7fffffffd390   0x7fffffffd390
r8             0x7ffff36ba3e0   140737277305824
r9             0x0  0
r10            0x8  8
r11            0x302    770
r12            0x0  0
r13            0x0  0
r14            0x7ffff3e8fb10   140737285520144
r15            0xa403d0 10748880
rip            0x7ffff34ad40b   0x7ffff34ad40b <doupdate+299>
eflags         0x202    [ IF ]
cs             0x33 51
ss             0x2b 43
ds             0x0  0
es             0x0  0
fs             0x0  0
gs             0x0  0
(gdb) p/a 0x7ffff3e8f948
$1 = 0x7ffff3e8f948 <cur_term>

(gdb) reverse-stepi

0x00007ffff34ad40d in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5

(gdb) reverse-stepi

0x00007ffff34ad40b in doupdate () from /lib/x86_64-linux-gnu/libncursesw.so.5

(gdb) i r

rax 0x7ffff3e8f948 140737285519688

rbx 0xaea060 11444320

rcx 0xae72a0 11432608

rdx 0xa403d0 10748880

rsi 0x0 0

rdi 0xa403d0 10748880

rbp 0x7ffff3e8fb10 0x7ffff3e8fb10 <SP>

rsp 0x7fffffffd390 0x7fffffffd390

r8 0x7ffff36ba3e0 140737277305824

r9 0x0 0

r10 0x8 8

r11 0x302 770

r12 0x0 0

r13 0x0 0

r14 0x7ffff3e8fb10 140737285520144

r15 0xa403d0 10748880

rip 0x7ffff34ad40b 0x7ffff34ad40b <doupdate+299>

eflags 0x202 [ IF ]

cs 0x33 51

ss 0x2b 43

ds 0x0 0

es 0x0 0

fs 0x0 0

gs 0x0 0

(gdb) p/a 0x7ffff3e8f948

$1 = 0x7ffff3e8f948 <cur_term>

所以，又找到了 “cur_term” 的線索。我很想看這裡的原始碼，但我將從除錯資訊入手。

11. 除錯資訊

這是 libncursesw，我沒有安裝除錯資訊（Ubuntu）：

# apt-cache search libncursesw
libncursesw5 - shared libraries for terminal handling (wide character support)
libncursesw5-dbg - debugging/profiling libraries for ncursesw
libncursesw5-dev - developer's libraries for ncursesw
# dpkg -l | grep libncursesw


ii  libncursesw5:amd64                  6.0+20160213-1ubuntu1                    amd64        shared libraries for terminal handling (wide character support)

# apt-cache search libncursesw

libncursesw5 - shared libraries for terminal handling (wide character support)

libncursesw5-dbg - debugging/profiling libraries for ncursesw

libncursesw5-dev - developer's libraries for ncursesw

# dpkg -l | grep libncursesw

ii libncursesw5:amd64 6.0+20160213-1ubuntu1 amd64 shared libraries for terminal handling (wide character support)

我把它裝上：

# apt-get install -y libncursesw5-dbg
Reading package lists... Done
Building dependency tree       
Reading state information... Done
[...]
After this operation, 2,488 kB of additional disk space will be used.
Get:1 http://us-west-1.ec2.archive.ubuntu.com/ubuntu xenial/main amd64 libncursesw5-dbg amd64 6.0+20160213-1ubuntu1 [729 kB]
Fetched 729 kB in 0s (865 kB/s)          
Selecting previously unselected package libncursesw5-dbg.
(Reading database ... 200094 files and directories currently installed.)
Preparing to unpack .../libncursesw5-dbg_6.0+20160213-1ubuntu1_amd64.deb ...
Unpacking libncursesw5-dbg (6.0+20160213-1ubuntu1) ...
Setting up libncursesw5-dbg (6.0+20160213-1ubuntu1) ...
# dpkg -l | grep libncursesw
ii  libncursesw5:amd64                  6.0+20160213-1ubuntu1                    amd64        shared libraries for terminal handling (wide character support)
ii  libncursesw5-dbg                    6.0+20160213-1ubuntu1                    amd64        debugging/profiling libraries for ncursesw

# apt-get install -y libncursesw5-dbg

Reading package lists... Done

Building dependency tree

Reading state information... Done

[...]

After this operation, 2,488 kB of additional disk space will be used.

Get:1 http://us-west-1.ec2.archive.ubuntu.com/ubuntu xenial/main amd64 libncursesw5-dbg amd64 6.0+20160213-1ubuntu1 [729 kB]

Fetched 729 kB in 0s (865 kB/s)

Selecting previously unselected package libncursesw5-dbg.

(Reading database ... 200094 files and directories currently installed.)

Preparing to unpack .../libncursesw5-dbg_6.0+20160213-1ubuntu1_amd64.deb ...

Unpacking libncursesw5-dbg (6.0+20160213-1ubuntu1) ...

Setting up libncursesw5-dbg (6.0+20160213-1ubuntu1) ...

# dpkg -l | grep libncursesw

ii libncursesw5:amd64 6.0+20160213-1ubuntu1 amd64 shared libraries for terminal handling (wide character support)

ii libncursesw5-dbg 6.0+20160213-1ubuntu1 amd64 debugging/profiling libraries for ncursesw

太好了，版本匹配。那麼現在我們的段錯誤是什麼樣子呢？

# gdb `which python` /var/cores/core.python.30520
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
[...]
warning: JITed object file architecture unknown is not compatible with target architecture i386:x86-64.
Core was generated by `python ./cachetop.py'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  ClrBlank (win=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1129
1129        if (back_color_erase)
(gdb) bt
#0  ClrBlank (win=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1129
#1  ClrUpdate () at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1147
#2  doupdate () at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1010
#3  0x00007f0a37aa07e6 in wrefresh (win=win@entry=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/base/lib_refresh.c:65
#4  0x00007f0a37a99499 in recur_wrefresh (win=win@entry=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/base/lib_getch.c:384
#5  0x00007f0a37a99616 in _nc_wgetch (win=win@entry=0x1993060, result=result@entry=0x7ffd33d93e24, use_meta=1)
    at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/base/lib_getch.c:491
#6  0x00007f0a37a9a325 in wgetch (win=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/base/lib_getch.c:672
#7  0x00007f0a37cc6ec3 in ?? () from /usr/lib/python2.7/lib-dynload/_curses.x86_64-linux-gnu.so
#8  0x00000000004c4d5a in PyEval_EvalFrameEx ()
#9  0x00000000004c2e05 in PyEval_EvalCodeEx ()
#10 0x00000000004def08 in ?? ()
#11 0x00000000004b1153 in PyObject_Call ()
#12 0x00000000004c73ec in PyEval_EvalFrameEx ()
#13 0x00000000004c2e05 in PyEval_EvalCodeEx ()
#14 0x00000000004caf42 in PyEval_EvalFrameEx ()
#15 0x00000000004c2e05 in PyEval_EvalCodeEx ()
#16 0x00000000004c2ba9 in PyEval_EvalCode ()
#17 0x00000000004f20ef in ?? ()
#18 0x00000000004eca72 in PyRun_FileExFlags ()
#19 0x00000000004eb1f1 in PyRun_SimpleFileExFlags ()
#20 0x000000000049e18a in Py_Main ()
#21 0x00007f0a3be10830 in __libc_start_main (main=0x49daf0 <main>, argc=2, argv=0x7ffd33d94838, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7ffd33d94828) at ../csu/libc-start.c:291
#22 0x000000000049da19 in _start ()

# gdb `which python` /var/cores/core.python.30520

GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1

[...]

warning: JITed object file architecture unknown is not compatible with target architecture i386:x86-64.

Core was generated by `python ./cachetop.py'.

Program terminated with signal SIGSEGV, Segmentation fault.

#0 ClrBlank (win=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1129

1129 if (back_color_erase)

(gdb) bt

#0 ClrBlank (win=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1129

#1 ClrUpdate () at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1147

#2 doupdate () at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1010

#3 0x00007f0a37aa07e6 in wrefresh (win=win@entry=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/base/lib_refresh.c:65

#4 0x00007f0a37a99499 in recur_wrefresh (win=win@entry=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/base/lib_getch.c:384

#5 0x00007f0a37a99616 in _nc_wgetch (win=win@entry=0x1993060, result=result@entry=0x7ffd33d93e24, use_meta=1)

at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/base/lib_getch.c:491

#6 0x00007f0a37a9a325 in wgetch (win=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/base/lib_getch.c:672

#7 0x00007f0a37cc6ec3 in ?? () from /usr/lib/python2.7/lib-dynload/_curses.x86_64-linux-gnu.so

#8 0x00000000004c4d5a in PyEval_EvalFrameEx ()

#9 0x00000000004c2e05 in PyEval_EvalCodeEx ()

#10 0x00000000004def08 in ?? ()

#11 0x00000000004b1153 in PyObject_Call ()

#12 0x00000000004c73ec in PyEval_EvalFrameEx ()

#13 0x00000000004c2e05 in PyEval_EvalCodeEx ()

#14 0x00000000004caf42 in PyEval_EvalFrameEx ()

#15 0x00000000004c2e05 in PyEval_EvalCodeEx ()

#16 0x00000000004c2ba9 in PyEval_EvalCode ()

#17 0x00000000004f20ef in ?? ()

#18 0x00000000004eca72 in PyRun_FileExFlags ()

#19 0x00000000004eb1f1 in PyRun_SimpleFileExFlags ()

#20 0x000000000049e18a in Py_Main ()

#21 0x00007f0a3be10830 in __libc_start_main (main=0x49daf0 <main>, argc=2, argv=0x7ffd33d94838, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,

stack_end=0x7ffd33d94828) at ../csu/libc-start.c:291

#22 0x000000000049da19 in _start ()

棧回溯看起來不太一樣：我們確實不在 doupdate() 裡邊，而是在 ClrBlank() 中，它內聯在 ClrUpdate() 裡，ClrUpdate() 又內聯在 doupdate() 中。

現在我真的要看原始碼了。

12. 原始碼

安裝了除錯資訊之後，gdb 可以同時列出原始碼和彙編：

(gdb) disas/s
Dump of assembler code for function doupdate:
/build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:
759 {
   0x00007f0a37aac2e0 <+0>:   push   %r15
   0x00007f0a37aac2e2 <+2>:   push   %r14
   0x00007f0a37aac2e4 <+4>:   push   %r13
   0x00007f0a37aac2e6 <+6>:   push   %r12
[...]
   0x00007f0a37aac3dd <+253>: jne    0x7f0a37aac6ca <doupdate+1002>

1009        if (CurScreen(SP_PARM)->_clear || NewScreen(SP_PARM)->_clear) {   /* force refresh ? */
   0x00007f0a37aac3e3 <+259>: mov    0x80(%rdx),%rax
   0x00007f0a37aac3ea <+266>: mov    0x88(%rdx),%rcx
   0x00007f0a37aac3f1 <+273>: cmpb   $0x0,0x21(%rax)
   0x00007f0a37aac3f5 <+277>: jne    0x7f0a37aac401 <doupdate+289>
   0x00007f0a37aac3f7 <+279>: cmpb   $0x0,0x21(%rcx)
   0x00007f0a37aac3fb <+283>: je     0x7f0a37aacc3b <doupdate+2395>

1129        if (back_color_erase)
   0x00007f0a37aac401 <+289>: mov    0x20cb68(%rip),%rax        # 0x7f0a37cb8f70
   0x00007f0a37aac408 <+296>: mov    (%rax),%rsi

1128        NCURSES_CH_T blank = blankchar;
   0x00007f0a37aac40b <+299>: xor    %eax,%eax

1129        if (back_color_erase)
=> 0x00007f0a37aac40d <+301>: mov    0x10(%rsi),%rdi
   0x00007f0a37aac411 <+305>: cmpb   $0x0,0x1c(%rdi)
   0x00007f0a37aac415 <+309>: jne    0x7f0a37aac6f7 <doupdate+1047>

(gdb) disas/s

Dump of assembler code for function doupdate:

/build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:

759 {

0x00007f0a37aac2e0 <+0>: push %r15

0x00007f0a37aac2e2 <+2>: push %r14

0x00007f0a37aac2e4 <+4>: push %r13

0x00007f0a37aac2e6 <+6>: push %r12

[...]

0x00007f0a37aac3dd <+253>: jne 0x7f0a37aac6ca <doupdate+1002>

1009 if (CurScreen(SP_PARM)->_clear || NewScreen(SP_PARM)->_clear) { /* force refresh ? */

0x00007f0a37aac3e3 <+259>: mov 0x80(%rdx),%rax

0x00007f0a37aac3ea <+266>: mov 0x88(%rdx),%rcx

0x00007f0a37aac3f1 <+273>: cmpb $0x0,0x21(%rax)

0x00007f0a37aac3f5 <+277>: jne 0x7f0a37aac401 <doupdate+289>

0x00007f0a37aac3f7 <+279>: cmpb $0x0,0x21(%rcx)

0x00007f0a37aac3fb <+283>: je 0x7f0a37aacc3b <doupdate+2395>

1129 if (back_color_erase)

0x00007f0a37aac401 <+289>: mov 0x20cb68(%rip),%rax # 0x7f0a37cb8f70

0x00007f0a37aac408 <+296>: mov (%rax),%rsi

1128 NCURSES_CH_T blank = blankchar;

0x00007f0a37aac40b <+299>: xor %eax,%eax

1129 if (back_color_erase)

=> 0x00007f0a37aac40d <+301>: mov 0x10(%rsi),%rdi

0x00007f0a37aac411 <+305>: cmpb $0x0,0x1c(%rdi)

0x00007f0a37aac415 <+309>: jne 0x7f0a37aac6f7 <doupdate+1047>

好極了！看 “=>” 和它上邊的程式碼。所以我們的段錯誤發生在 “if (back_color_erase)” ？看起來不可能。

這裡我檢查了一下，我的除錯資訊版本是對的，重新在 gdb 裡邊執行程式直到發生段錯誤。錯誤相同。

back_color_erase 有什麼特殊嗎？我們現在在 ClrBlank() 中，我先列出原始碼：

(gdb) list ClrBlank
1124    
1125    static NCURSES_INLINE NCURSES_CH_T
1126    ClrBlank(NCURSES_SP_DCLx WINDOW *win)
1127    {
1128        NCURSES_CH_T blank = blankchar;
1129        if (back_color_erase)
1130        AddAttr(blank, (AttrOf(BCE_BKGD(SP_PARM, win)) & BCE_ATTRS));
1131        return blank;
1132    }
1133

(gdb) list ClrBlank

1124

1125 static NCURSES_INLINE NCURSES_CH_T

1126 ClrBlank(NCURSES_SP_DCLx WINDOW *win)

1127 {

1128 NCURSES_CH_T blank = blankchar;

1129 if (back_color_erase)

1130 AddAttr(blank, (AttrOf(BCE_BKGD(SP_PARM, win)) & BCE_ATTRS));

1131 return blank;

1132 }

1133

啊，在這個函式裡邊沒定義，難道是全域性變數？

13. TUI

有必要看看這些程式碼在 gdb 的文字使用者介面（TUI）裡是什麼樣的，我用的不多，是看了 Greg 的演講之後受到的啟發。

你可以用 –tui 來啟動：

# gdb --tui `which python` /var/cores/core.python.30520
   ┌───────────────────────────────────────────────────────────────────────────┐
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │             [ No Source Available ]                                       │
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │                                                                           │
   │                                                                           │
   └───────────────────────────────────────────────────────────────────────────┘
None No process In:                                                L??   PC: ?? 
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
---Type  to continue, or q  to quit---

# gdb --tui `which python` /var/cores/core.python.30520

┌───────────────────────────────────────────────────────────────────────────┐

│ │

│ [ No Source Available ] │

│ │

└───────────────────────────────────────────────────────────────────────────┘

None No process In: L?? PC: ??

GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1

License GPLv3+: GNU GPL version 3 or later

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law. Type "show copying"

and "show warranty" for details.

This GDB was configured as "x86_64-linux-gnu".

---Type to continue, or q to quit---

它在抱怨沒有 Python 原始碼。我可以搞定，但是我們是在 libncursesw 裡邊崩潰的。所以不管它敲回車讓它完成載入，在發生錯誤的地方載入了 libncursesw 除錯資訊裡的原始碼：

   ┌──/build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c──────┐
   │1124                                                                       │
   │1125    static NCURSES_INLINE NCURSES_CH_T                                 │
   │1126    ClrBlank(NCURSES_SP_DCLx WINDOW *win)                              │
   │1127    {                                                                  │
   │1128        NCURSES_CH_T blank = blankchar;                                │
  >│1129        if (back_color_erase)                                          │
   │1130            AddAttr(blank, (AttrOf(BCE_BKGD(SP_PARM, win)) & BCE_ATTRS)│
   │1131        return blank;                                                  │
   │1132    }                                                                  │
   │1133                                                                       │
   │1134    /*                                                                 │
   │1135    **      ClrUpdate()                                                │
   │1136    **                                                                 │
   └───────────────────────────────────────────────────────────────────────────┘
multi-thre Thread 0x7f0a3c5e87 In: doupdate            L1129 PC: 0x7f0a37aac40d 
warning: JITed object file architecture unknown is not compatible with target ar
chitecture i386:x86-64.
---Type <return> to continue, or q <return> to quit---
Core was generated by `python ./cachetop.py'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  ClrBlank (win=0x1993060)
    at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1129
(gdb)

┌──/build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c──────┐

│1124 │

│1125 static NCURSES_INLINE NCURSES_CH_T │

│1126 ClrBlank(NCURSES_SP_DCLx WINDOW *win) │

│1127 { │

│1128 NCURSES_CH_T blank = blankchar; │

>│1129 if (back_color_erase) │

│1130 AddAttr(blank, (AttrOf(BCE_BKGD(SP_PARM, win)) & BCE_ATTRS)│

│1131 return blank; │

│1132 } │

│1133 │

│1134 /* │

│1135 ** ClrUpdate() │

│1136 ** │

multi-thre Thread 0x7f0a3c5e87 In: doupdate L1129 PC: 0x7f0a37aac40d

warning: JITed object file architecture unknown is not compatible with target ar

chitecture i386:x86-64.

---Type <return> to continue, or q <return> to quit---

Core was generated by `python ./cachetop.py'.

Program terminated with signal SIGSEGV, Segmentation fault.

#0 ClrBlank (win=0x1993060)

at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1129

(gdb)

棒極了！

“>” 指向發生崩潰的那行程式碼。更棒的是：用 layout split 命令，我們可以在不同的視窗檢視原始碼和彙編程式碼。

   ┌──/build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c──────┐
  >│1129        if (back_color_erase)                                          │
   │1130            AddAttr(blank, (AttrOf(BCE_BKGD(SP_PARM, win)) & BCE_ATTRS)│
   │1131        return blank;                                                  │
   │1132    }                                                                  │
   │1133                                                                       │
   │1134    /*                                                                 │
   │1135    **      ClrUpdate()                                                │
   └───────────────────────────────────────────────────────────────────────────┘
  >│0x7f0a37aac40d <doupdate+301>   mov    0x10(%rsi),%rdi                     │
   │0x7f0a37aac411 <doupdate+305>   cmpb   $0x0,0x1c(%rdi)                     │
   │0x7f0a37aac415 <doupdate+309>   jne    0x7f0a37aac6f7 <doupdate+1047>      │
   │0x7f0a37aac41b <doupdate+315>   movswl 0x4(%rcx),%ecx                      │
   │0x7f0a37aac41f <doupdate+319>   movswl 0x74(%rdx),%edi                     │
   │0x7f0a37aac423 <doupdate+323>   mov    %rax,0x40(%rsp)                     │
   │0x7f0a37aac428 <doupdate+328>   movl   $0x20,0x48(%rsp)                    │
   │0x7f0a37aac430 <doupdate+336>   movl   $0x0,0x4c(%rsp)                     │
   └───────────────────────────────────────────────────────────────────────────┘
multi-thre Thread 0x7f0a3c5e87 In: doupdate            L1129 PC: 0x7f0a37aac40d 
chitecture i386:x86-64.
Core was generated by `python ./cachetop.py'.
Program terminated with signal SIGSEGV, Segmentation fault.
---Type <return> to continue, or q <return> to quit---
#0  ClrBlank (win=0x1993060)
    at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1129
(gdb) layout split

┌──/build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c──────┐

>│1129 if (back_color_erase) │

│1130 AddAttr(blank, (AttrOf(BCE_BKGD(SP_PARM, win)) & BCE_ATTRS)│

│1131 return blank; │

│1132 } │

│1133 │

│1134 /* │

│1135 ** ClrUpdate() │

>│0x7f0a37aac40d <doupdate+301> mov 0x10(%rsi),%rdi │

│0x7f0a37aac411 <doupdate+305> cmpb $0x0,0x1c(%rdi) │

│0x7f0a37aac415 <doupdate+309> jne 0x7f0a37aac6f7 <doupdate+1047> │

│0x7f0a37aac41b <doupdate+315> movswl 0x4(%rcx),%ecx │

│0x7f0a37aac41f <doupdate+319> movswl 0x74(%rdx),%edi │

│0x7f0a37aac423 <doupdate+323> mov %rax,0x40(%rsp) │

│0x7f0a37aac428 <doupdate+328> movl $0x20,0x48(%rsp) │

│0x7f0a37aac430 <doupdate+336> movl $0x0,0x4c(%rsp) │

multi-thre Thread 0x7f0a3c5e87 In: doupdate L1129 PC: 0x7f0a37aac40d

chitecture i386:x86-64.

Core was generated by `python ./cachetop.py'.

Program terminated with signal SIGSEGV, Segmentation fault.

---Type <return> to continue, or q <return> to quit---

#0 ClrBlank (win=0x1993060)

at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1129

(gdb) layout split

Greg 演示這個的時候，和這裡的順序相反，因此你可想像同時檢視原始碼和彙編的情景（這裡我需要一個視訊來演示）。

14. 外部工具：cscope

我需要對 back_color_erase 有更多瞭解，我可以試試 gdb 的搜尋命令，但是我發現用一個外部工具：cscope 更快。 cscope 是一個基於文字的程式碼瀏覽器，誕生於80年代的貝爾實驗室。如果你有喜歡的現代 IDE，可以不用它。

安裝 cscope：

# apt-get install -y cscope
# wget http://archive.ubuntu.com/ubuntu/pool/main/n/ncurses/ncurses_6.0+20160213.orig.tar.gz
# tar xvf ncurses_6.0+20160213.orig.tar.gz
# cd ncurses-6.0-20160213
# cscope -bqR
# cscope -dq

# apt-get install -y cscope

# wget http://archive.ubuntu.com/ubuntu/pool/main/n/ncurses/ncurses_6.0+20160213.orig.tar.gz

# tar xvf ncurses_6.0+20160213.orig.tar.gz

# cd ncurses-6.0-20160213

# cscope -bqR

# cscope -dq

cscope -bqR　用來建立查詢資料庫。cscope -dq 用來啟動 cscope。

查詢 back_color_erase 的定義：

Cscope version 15.8b                                   Press the ? key for help












Find this C symbol:
Find this global definition: back_color_erase
Find functions called by this function:
Find functions calling this function:
Find this text string:
Change this text string:
Find this egrep pattern:
Find this file:
Find files #including this file:
Find assignments to this symbol:

Cscope version 15.8b Press the ? key for help

Find this C symbol:

Find this global definition: back_color_erase

Find functions called by this function:

Find functions calling this function:

Find this text string:

Change this text string:

Find this egrep pattern:

Find this file:

Find files #including this file:

Find assignments to this symbol:

敲回車：

[...]
#define non_dest_scroll_region         CUR Booleans[26]
#define can_change                     CUR Booleans[27]
#define back_color_erase               CUR Booleans[28]
#define hue_lightness_saturation       CUR Booleans[29]
#define col_addr_glitch                CUR Booleans[30]
#define cr_cancels_micro_mode          CUR Booleans[31]


[...]

[...]

#define non_dest_scroll_region CUR Booleans[26]

#define can_change CUR Booleans[27]

#define back_color_erase CUR Booleans[28]

#define hue_lightness_saturation CUR Booleans[29]

#define col_addr_glitch CUR Booleans[30]

#define cr_cancels_micro_mode CUR Booleans[31]

[...]

哦，一個巨集定義。（作為巨集定義的常見的形式，它們至少應該大寫）

好吧，那麼 CUR 是什麼呢？用 cscope 查詢定義易如反掌。

#define CUR cur_term->type.

1	#define CUR cur_term->type.

起碼這個巨集定義是大寫的！

我們通過逐條檢視指令和暫存器找更早定義的 cur_term 。它是什麼呢？

#if 0 && !0
extern NCURSES_EXPORT_VAR(TERMINAL *) cur_term;
#elif 0
NCURSES_WRAPPED_VAR(TERMINAL *, cur_term);
#define cur_term   NCURSES_PUBLIC_VAR(cur_term())
#else
extern NCURSES_EXPORT_VAR(TERMINAL *) cur_term;
#endif

#if 0 && !0

extern NCURSES_EXPORT_VAR(TERMINAL *) cur_term;

#elif 0

NCURSES_WRAPPED_VAR(TERMINAL *, cur_term);

#define cur_term NCURSES_PUBLIC_VAR(cur_term())

#else

extern NCURSES_EXPORT_VAR(TERMINAL *) cur_term;

#endif

cscope 讀取了 /usr/include/term.h 。好吧，更多的巨集。我用加粗來突出這行程式碼，我認為它產生了影響。為什麼這裡會有 “if 0 && !0 … elif 0” ？我不清楚（需要再讀些程式碼）。有時程式設計師會在他們想要在產品中失效的除錯程式碼附近使用 “#if 0”，可是，這個好像是自動生成的。

查詢 NCURSES_EXPORT_VAR 發現：

#  define NCURSES_EXPORT_VAR(type) NCURSES_IMPEXP type

1	# define NCURSES_EXPORT_VAR(type) NCURSES_IMPEXP type

… 和 NCURSES_IMPEXP：

/* Take care of non-cygwin platforms */
#if !defined(NCURSES_IMPEXP)          
#  define NCURSES_IMPEXP /* nothing */
#endif                                
#if !defined(NCURSES_API)             
#  define NCURSES_API /* nothing */   
#endif                                
#if !defined(NCURSES_EXPORT)          
#  define NCURSES_EXPORT(type) NCURSES_IMPEXP type NCURSES_API
#endif                                
#if !defined(NCURSES_EXPORT_VAR)      
#  define NCURSES_EXPORT_VAR(type) NCURSES_IMPEXP type
#endif

/* Take care of non-cygwin platforms */

#if !defined(NCURSES_IMPEXP)

# define NCURSES_IMPEXP /* nothing */

#endif

#if !defined(NCURSES_API)

# define NCURSES_API /* nothing */

#endif

#if !defined(NCURSES_EXPORT)

# define NCURSES_EXPORT(type) NCURSES_IMPEXP type NCURSES_API

#endif

#if !defined(NCURSES_EXPORT_VAR)

# define NCURSES_EXPORT_VAR(type) NCURSES_IMPEXP type

#endif

… 還有 TERMINAL：

typedef struct term {       /* describe an actual terminal */
    TERMTYPE    type;       /* terminal type description */
    short   Filedes;    /* file description being written to */
    TTY     Ottyb,      /* original state of the terminal */
        Nttyb;      /* current state of the terminal */
    int     _baudrate;  /* used to compute padding */
    char *      _termname;      /* used for termname() */


} TERMINAL;

typedef struct term { /* describe an actual terminal */

TERMTYPE type; /* terminal type description */

short Filedes; /* file description being written to */

TTY Ottyb, /* original state of the terminal */

Nttyb; /* current state of the terminal */

int _baudrate; /* used to compute padding */

char * _termname; /* used for termname() */

} TERMINAL;

嗨！TERMINAL 是大寫的。和巨集混在一起，這個程式碼不太好跟蹤 …

好吧，到底是誰給 cur_term 賦的值呢？記住我們的問題是它被賦值為零，也許因為它未被初始化或顯式賦值。瀏覽給它賦值的程式碼路徑可能會找到更多的線索，來回答為什麼沒被初始化，或為什麼被賦值為零。使用 cscope 的第一個選項：

Find this C symbol: cur_term
Find this global definition:
Find functions called by this function:
Find functions calling this function:
[...]

Find this C symbol: cur_term

Find this global definition:

Find functions called by this function:

Find functions calling this function:

[...]

快速瀏覽項發現：

NCURSES_EXPORT(TERMINAL *)
NCURSES_SP_NAME(set_curterm) (NCURSES_SP_DCLx TERMINAL * termp)
{
    TERMINAL *oldterm;

    T((T_CALLED("set_curterm(%p)"), (void *) termp));

    _nc_lock_global(curses);
    oldterm = cur_term;
    if (SP_PARM)
    SP_PARM->_term = termp;
#if USE_REENTRANT
    CurTerm = termp;
#else
    cur_term = termp;
#endif

NCURSES_EXPORT(TERMINAL *)

NCURSES_SP_NAME(set_curterm) (NCURSES_SP_DCLx TERMINAL * termp)

{

TERMINAL *oldterm;

T((T_CALLED("set_curterm(%p)"), (void *) termp));

_nc_lock_global(curses);

oldterm = cur_term;

if (SP_PARM)

SP_PARM->_term = termp;

#if USE_REENTRANT

CurTerm = termp;

#else

cur_term = termp;

#endif

我加了高亮。甚至函式名稱都被封裝在巨集裡。但至少我們發現了 cur_term 如何被賦值的：通過 set_curterm()。也許它沒被呼叫？

15. 外部工具：perf-tools/ftrace/uprobes

我稍後將介紹如何用 gdb 解決這個問題，可是我忍不住嘗試我 perf-tools 工具箱裡的 uprobe 工具，它使用 Linux 下的 ftrace 和 uprobes。用 tracers 的一個好處是它不會終止目標程式，像 gdb 一樣（儘管對於這裡的 cachetop.py 沒什麼用）。另一個好處是追蹤幾個和幾千個程式一樣容易。

我應該能追蹤 libncursesw 對 set_curterm() 的呼叫，甚至列印出它的第一個引數：

# /apps/perf-tools/bin/uprobe 'p:/lib/x86_64-linux-gnu/libncursesw.so.5:set_curterm %di'
ERROR: missing symbol "set_curterm" in /lib/x86_64-linux-gnu/libncursesw.so.5

1 2	# /apps/perf-tools/bin/uprobe 'p:/lib/x86_64-linux-gnu/libncursesw.so.5:set_curterm %di' ERROR: missing symbol "set_curterm" in /lib/x86_64-linux-gnu/libncursesw.so.5

咦，沒起作用。set_curterm() 在哪？有很多方法可以找到它，比如 gdb 或 objdump：

(gdb) info symbol set_curterm
set_curterm in section .text of /lib/x86_64-linux-gnu/libtinfo.so.5
# objdump -tT /lib/x86_64-linux-gnu/libncursesw.so.5 | grep cur_term
0000000000000000      DO *UND*  0000000000000000  NCURSES_TINFO_5.0.19991023 cur_term
# objdump -tT /lib/x86_64-linux-gnu/libtinfo.so.5 | grep cur_term
0000000000228948 g    DO .bss   0000000000000008  NCURSES_TINFO_5.0.19991023 cur_term

(gdb) info symbol set_curterm

set_curterm in section .text of /lib/x86_64-linux-gnu/libtinfo.so.5

# objdump -tT /lib/x86_64-linux-gnu/libncursesw.so.5 | grep cur_term

0000000000000000 DO *UND* 0000000000000000 NCURSES_TINFO_5.0.19991023 cur_term

# objdump -tT /lib/x86_64-linux-gnu/libtinfo.so.5 | grep cur_term

0000000000228948 g DO .bss 0000000000000008 NCURSES_TINFO_5.0.19991023 cur_term

gdb 表現的好些。此外如果仔細看原始碼，我注意到它是為 libtinfo 構建的。

試著在 libtinfo 裡邊查詢 set_curterm() ：

# /apps/perf-tools/bin/uprobe 'p:/lib/x86_64-linux-gnu/libtinfo.so.5:set_curterm %di'
Tracing uprobe set_curterm (p:set_curterm /lib/x86_64-linux-gnu/libtinfo.so.5:0xfa80 %di). Ctrl-C to end.
          python-31617 [007] d... 24236402.719959: set_curterm: (0x7f116fcc2a80) arg1=0x1345d70
          python-31617 [007] d... 24236402.720033: set_curterm: (0x7f116fcc2a80) arg1=0x13a22e0
          python-31617 [007] d... 24236402.723804: set_curterm: (0x7f116fcc2a80) arg1=0x14cdfa0
          python-31617 [007] d... 24236402.723838: set_curterm: (0x7f116fcc2a80) arg1=0x0
^C

# /apps/perf-tools/bin/uprobe 'p:/lib/x86_64-linux-gnu/libtinfo.so.5:set_curterm %di'

Tracing uprobe set_curterm (p:set_curterm /lib/x86_64-linux-gnu/libtinfo.so.5:0xfa80 %di). Ctrl-C to end.

python-31617 [007] d... 24236402.719959: set_curterm: (0x7f116fcc2a80) arg1=0x1345d70

python-31617 [007] d... 24236402.720033: set_curterm: (0x7f116fcc2a80) arg1=0x13a22e0

python-31617 [007] d... 24236402.723804: set_curterm: (0x7f116fcc2a80) arg1=0x14cdfa0

python-31617 [007] d... 24236402.723838: set_curterm: (0x7f116fcc2a80) arg1=0x0

找到了。所以 set_curterm() 被呼叫了，並且被呼叫了四次。最後一次被傳了一個零，看起來這就是問題所在。

如果你覺得疑惑，我怎麼就知道 %di 暫存器就是第一個引數呢，因為 AMD64/x86_64 ABI 寫著呢（假設這個庫和 ABI 相容）。這裡有提示：

# man syscall
[...]
       arch/ABI      arg1  arg2  arg3  arg4  arg5  arg6  arg7  Notes
       ──────────────────────────────────────────────────────────────────
       arm/OABI      a1    a2    a3    a4    v1    v2    v3
       arm/EABI      r0    r1    r2    r3    r4    r5    r6
       arm64         x0    x1    x2    x3    x4    x5    -
       blackfin      R0    R1    R2    R3    R4    R5    -
       i386          ebx   ecx   edx   esi   edi   ebp   -
       ia64          out0  out1  out2  out3  out4  out5  -
       mips/o32      a0    a1    a2    a3    -     -     -     See below
       mips/n32,64   a0    a1    a2    a3    a4    a5    -
       parisc        r26   r25   r24   r23   r22   r21   -
       s390          r2    r3    r4    r5    r6    r7    -
       s390x         r2    r3    r4    r5    r6    r7    -
       sparc/32      o0    o1    o2    o3    o4    o5    -
       sparc/64      o0    o1    o2    o3    o4    o5    -
       x86_64        rdi   rsi   rdx   r10   r8    r9    -
[...]

# man syscall

[...]

arch/ABI arg1 arg2 arg3 arg4 arg5 arg6 arg7 Notes

──────────────────────────────────────────────────────────────────

arm/OABI a1 a2 a3 a4 v1 v2 v3

arm/EABI r0 r1 r2 r3 r4 r5 r6

arm64 x0 x1 x2 x3 x4 x5 -

blackfin R0 R1 R2 R3 R4 R5 -

i386 ebx ecx edx esi edi ebp -

ia64 out0 out1 out2 out3 out4 out5 -

mips/o32 a0 a1 a2 a3 - - - See below

mips/n32,64 a0 a1 a2 a3 a4 a5 -

parisc r26 r25 r24 r23 r22 r21 -

s390 r2 r3 r4 r5 r6 r7 -

s390x r2 r3 r4 r5 r6 r7 -

sparc/32 o0 o1 o2 o3 o4 o5 -

sparc/64 o0 o1 o2 o3 o4 o5 -

x86_64 rdi rsi rdx r10 r8 r9 -

[...]

我還想知道呼叫 arg1=0x0 的堆疊資訊，但是 ftrace 還不支援棧追蹤。

16. 外部工具：bcc/BPF

由於我們在除錯 bcc 工具 cachetop.py，值得注意的是 bcc 裡的 trace.py 有和我的老工具 uprobe 類似的功能：

# ./trace.py 'p:tinfo:set_curterm "%d", arg1'
TIME     PID    COMM         FUNC             -
01:00:20 31698  python       set_curterm      38018416
01:00:20 31698  python       set_curterm      38396640
01:00:20 31698  python       set_curterm      39624608
01:00:20 31698  python       set_curterm      0

# ./trace.py 'p:tinfo:set_curterm "%d", arg1'

TIME PID COMM FUNC -

01:00:20 31698 python set_curterm 38018416

01:00:20 31698 python set_curterm 38396640

01:00:20 31698 python set_curterm 39624608

01:00:20 31698 python set_curterm 0

是的，我們在用 bcc 除錯 bcc ！

如果你對 bcc 不熟悉，它值得一看。它為 Linux4.x 系列裡的 BPF 新特性提供了 Python 和 lua 介面。總之，它能讓很多以前不可能或昂貴以致無法執行的效能工具執行起來。我以前發過貼介紹如何在 Ubuntu Xenial 上執行它。

bcc 的 trace.py 工具應該有一個開關來決定是否列印使用者堆疊，因為核心從 Linux4.6 開始具備 BPF 堆疊功能，不過到寫這篇文章的時候我們還沒有加上這個開關。

17. 更多的斷點

我真的應該從在 set_curterm() 下了斷點的 gdb 入手，可是我覺得我們走的彎路，使用ftrace和BPF的還是蠻有趣的。

回到實時執行模式：

# gdb `which python`
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
[...]
(gdb) b set_curterm
Function "set_curterm" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (set_curterm) pending.
(gdb) r cachetop.py
Starting program: /usr/bin/python cachetop.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Breakpoint 1, set_curterm (termp=termp@entry=0xa43150) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80  {
(gdb) c
Continuing.

Breakpoint 1, set_curterm (termp=termp@entry=0xab5870) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80  {
(gdb) c
Continuing.

Breakpoint 1, set_curterm (termp=termp@entry=0xbecb90) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80  {
(gdb) c
Continuing.

Breakpoint 1, set_curterm (termp=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80  {

# gdb `which python`

GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1

[...]

(gdb) b set_curterm

Function "set_curterm" not defined.

Make breakpoint pending on future shared library load? (y or [n]) y

Breakpoint 1 (set_curterm) pending.

(gdb) r cachetop.py

Starting program: /usr/bin/python cachetop.py

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, set_curterm (termp=termp@entry=0xa43150) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80

80 {

(gdb) c

Continuing.

Breakpoint 1, set_curterm (termp=termp@entry=0xab5870) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80

80 {

(gdb) c

Continuing.

Breakpoint 1, set_curterm (termp=termp@entry=0xbecb90) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80

80 {

(gdb) c

Continuing.

Breakpoint 1, set_curterm (termp=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80

80 {

好的，在這個斷點我們可以看到 set_curterm() 被呼叫了，被傳了一個 termp = 0x0 的引數，多虧了 debuginfo 提供的資訊。如果沒有 debuginfo ，我只能在每個斷點處列印暫存器值。

我列印棧幀出來，這樣我們可以看到是誰將 curterm 設為零的。

(gdb) bt
#0  set_curterm (termp=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
#1  0x00007ffff5a44e75 in llvm::sys::Process::FileDescriptorHasColors(int) () from /usr/lib/x86_64-linux-gnu/libbcc.so.0
#2  0x00007ffff45cabb8 in clang::driver::tools::Clang::ConstructJob(clang::driver::Compilation&, clang::driver::JobAction const&, clang::driver::InputInfo const&, llvm::SmallVector<clang::driver::InputInfo, 4u> const&, llvm::opt::ArgList const&, char const*) const () from /usr/lib/x86_64-linux-gnu/libbcc.so.0
#3  0x00007ffff456ffa5 in clang::driver::Driver::BuildJobsForAction(clang::driver::Compilation&, clang::driver::Action const*, clang::driver::ToolChain const*, char const*, bool, bool, char const*, clang::driver::InputInfo&) const () from /usr/lib/x86_64-linux-gnu/libbcc.so.0
#4  0x00007ffff4570501 in clang::driver::Driver::BuildJobs(clang::driver::Compilation&) const () from /usr/lib/x86_64-linux-gnu/libbcc.so.0
#5  0x00007ffff457224a in clang::driver::Driver::BuildCompilation(llvm::ArrayRef<char const*>) () from /usr/lib/x86_64-linux-gnu/libbcc.so.0
#6  0x00007ffff4396cda in ebpf::ClangLoader::parse(std::unique_ptr<llvm::Module, std::default_delete<llvm::Module> >*, std::unique_ptr<std::vector<ebpf::TableDesc, std::allocator<ebpf::TableDesc> >, std::default_delete<std::vector<ebpf::TableDesc, std::allocator<ebpf::TableDesc> > > >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, char const**, int) () from /usr/lib/x86_64-linux-gnu/libbcc.so.0
#7  0x00007ffff4344314 in ebpf::BPFModule::load_cfile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, char const**, int) ()
   from /usr/lib/x86_64-linux-gnu/libbcc.so.0
#8  0x00007ffff4349e5e in ebpf::BPFModule::load_string(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const**, int) ()
   from /usr/lib/x86_64-linux-gnu/libbcc.so.0
#9  0x00007ffff43430c8 in bpf_module_create_c_from_string () from /usr/lib/x86_64-linux-gnu/libbcc.so.0
#10 0x00007ffff690ae40 in ffi_call_unix64 () from /usr/lib/x86_64-linux-gnu/libffi.so.6
#11 0x00007ffff690a8ab in ffi_call () from /usr/lib/x86_64-linux-gnu/libffi.so.6
#12 0x00007ffff6b1a68c in _ctypes_callproc () from /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so
#13 0x00007ffff6b1ed82 in ?? () from /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so
#14 0x00000000004b1153 in PyObject_Call ()
#15 0x00000000004ca5ca in PyEval_EvalFrameEx ()
#16 0x00000000004c2e05 in PyEval_EvalCodeEx ()
#17 0x00000000004def08 in ?? ()
#18 0x00000000004b1153 in PyObject_Call ()
#19 0x00000000004f4c3e in ?? ()
#20 0x00000000004b1153 in PyObject_Call ()
#21 0x00000000004f49b7 in ?? ()
#22 0x00000000004b6e2c in ?? ()
#23 0x00000000004b1153 in PyObject_Call ()
#24 0x00000000004ca5ca in PyEval_EvalFrameEx ()
#25 0x00000000004c2e05 in PyEval_EvalCodeEx ()
#26 0x00000000004def08 in ?? ()
#27 0x00000000004b1153 in PyObject_Call ()
#28 0x00000000004c73ec in PyEval_EvalFrameEx ()
#29 0x00000000004c2e05 in PyEval_EvalCodeEx ()
#30 0x00000000004caf42 in PyEval_EvalFrameEx ()
#31 0x00000000004c2e05 in PyEval_EvalCodeEx ()
#32 0x00000000004c2ba9 in PyEval_EvalCode ()
#33 0x00000000004f20ef in ?? ()
#34 0x00000000004eca72 in PyRun_FileExFlags ()
#35 0x00000000004eb1f1 in PyRun_SimpleFileExFlags ()
#36 0x000000000049e18a in Py_Main ()
#37 0x00007ffff7811830 in __libc_start_main (main=0x49daf0 <main>, argc=2, argv=0x7fffffffdfb8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7fffffffdfa8) at ../csu/libc-start.c:291
#38 0x000000000049da19 in _start ()

(gdb) bt

#0 set_curterm (termp=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80

#1 0x00007ffff5a44e75 in llvm::sys::Process::FileDescriptorHasColors(int) () from /usr/lib/x86_64-linux-gnu/libbcc.so.0

#2 0x00007ffff45cabb8 in clang::driver::tools::Clang::ConstructJob(clang::driver::Compilation&, clang::driver::JobAction const&, clang::driver::InputInfo const&, llvm::SmallVector<clang::driver::InputInfo, 4u> const&, llvm::opt::ArgList const&, char const*) const () from /usr/lib/x86_64-linux-gnu/libbcc.so.0

#3 0x00007ffff456ffa5 in clang::driver::Driver::BuildJobsForAction(clang::driver::Compilation&, clang::driver::Action const*, clang::driver::ToolChain const*, char const*, bool, bool, char const*, clang::driver::InputInfo&) const () from /usr/lib/x86_64-linux-gnu/libbcc.so.0

#4 0x00007ffff4570501 in clang::driver::Driver::BuildJobs(clang::driver::Compilation&) const () from /usr/lib/x86_64-linux-gnu/libbcc.so.0

#5 0x00007ffff457224a in clang::driver::Driver::BuildCompilation(llvm::ArrayRef<char const*>) () from /usr/lib/x86_64-linux-gnu/libbcc.so.0

#6 0x00007ffff4396cda in ebpf::ClangLoader::parse(std::unique_ptr<llvm::Module, std::default_delete<llvm::Module> >*, std::unique_ptr<std::vector<ebpf::TableDesc, std::allocator<ebpf::TableDesc> >, std::default_delete<std::vector<ebpf::TableDesc, std::allocator<ebpf::TableDesc> > > >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, char const**, int) () from /usr/lib/x86_64-linux-gnu/libbcc.so.0

#7 0x00007ffff4344314 in ebpf::BPFModule::load_cfile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, char const**, int) ()

from /usr/lib/x86_64-linux-gnu/libbcc.so.0

#8 0x00007ffff4349e5e in ebpf::BPFModule::load_string(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const**, int) ()

from /usr/lib/x86_64-linux-gnu/libbcc.so.0

#9 0x00007ffff43430c8 in bpf_module_create_c_from_string () from /usr/lib/x86_64-linux-gnu/libbcc.so.0

#10 0x00007ffff690ae40 in ffi_call_unix64 () from /usr/lib/x86_64-linux-gnu/libffi.so.6

#11 0x00007ffff690a8ab in ffi_call () from /usr/lib/x86_64-linux-gnu/libffi.so.6

#12 0x00007ffff6b1a68c in _ctypes_callproc () from /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so

#13 0x00007ffff6b1ed82 in ?? () from /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so

#14 0x00000000004b1153 in PyObject_Call ()

#15 0x00000000004ca5ca in PyEval_EvalFrameEx ()

#16 0x00000000004c2e05 in PyEval_EvalCodeEx ()

#17 0x00000000004def08 in ?? ()

#18 0x00000000004b1153 in PyObject_Call ()

#19 0x00000000004f4c3e in ?? ()

#20 0x00000000004b1153 in PyObject_Call ()

#21 0x00000000004f49b7 in ?? ()

#22 0x00000000004b6e2c in ?? ()

#23 0x00000000004b1153 in PyObject_Call ()

#24 0x00000000004ca5ca in PyEval_EvalFrameEx ()

#25 0x00000000004c2e05 in PyEval_EvalCodeEx ()

#26 0x00000000004def08 in ?? ()

#27 0x00000000004b1153 in PyObject_Call ()

#28 0x00000000004c73ec in PyEval_EvalFrameEx ()

#29 0x00000000004c2e05 in PyEval_EvalCodeEx ()

#30 0x00000000004caf42 in PyEval_EvalFrameEx ()

#31 0x00000000004c2e05 in PyEval_EvalCodeEx ()

#32 0x00000000004c2ba9 in PyEval_EvalCode ()

#33 0x00000000004f20ef in ?? ()

#34 0x00000000004eca72 in PyRun_FileExFlags ()

#35 0x00000000004eb1f1 in PyRun_SimpleFileExFlags ()

#36 0x000000000049e18a in Py_Main ()

#37 0x00007ffff7811830 in __libc_start_main (main=0x49daf0 <main>, argc=2, argv=0x7fffffffdfb8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,

stack_end=0x7fffffffdfa8) at ../csu/libc-start.c:291

#38 0x000000000049da19 in _start ()

好了，有了更多的線索…我認為。我們在 llvm::sys::Process::FileDescriptorHasColors()裡邊。llvm 編譯器有問題？

18. 外部工具：cscope，再來一次

程式碼較多的時候使用 cscope 檢視，這次是 llvm。FileDescriptorHasColors() 函式：

static bool terminalHasColors(int fd) {
[...]
  // Now extract the structure allocated by setupterm and free its memory
  // through a really silly dance.
  struct term *termp = set_curterm((struct term *)nullptr);
  (void)del_curterm(termp); // Drop any errors here.

static bool terminalHasColors(int fd) {

[...]

// Now extract the structure allocated by setupterm and free its memory

// through a really silly dance.

struct term *termp = set_curterm((struct term *)nullptr);

(void)del_curterm(termp); // Drop any errors here.

這是較早版本中使用的程式碼：

static bool terminalHasColors() {
  if (const char *term = std::getenv("TERM")) {
    // Most modern terminals support ANSI escape sequences for colors.
    // We could check terminfo, or have a list of known terms that support
    // colors, but that would be overkill.
    // The user can always ask for no colors by setting TERM to dumb, or
    // using a commandline flag.
    return strcmp(term, "dumb") != 0;
  }
  return false;
}

static bool terminalHasColors() {

if (const char *term = std::getenv("TERM")) {

// Most modern terminals support ANSI escape sequences for colors.

// We could check terminfo, or have a list of known terms that support

// colors, but that would be overkill.

// The user can always ask for no colors by setting TERM to dumb, or

// using a commandline flag.

return strcmp(term, "dumb") != 0;

}

return false;

}

用空指標呼叫 set_curterm() 變成了 “愚蠢的舞蹈” 。

19. 寫記憶體

作為實驗，我要修改程式記憶體來避免 set_curterm() 被置零，用來探索可能的解決方法。

執行 gdb ，在 set_curterm() 下斷點，跑到零呼叫的地方：

# gdb `which python`
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1                                  
[...]
(gdb) b set_curterm
Function "set_curterm" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (set_curterm) pending.
(gdb) r cachetop.py
Starting program: /usr/bin/python cachetop.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, set_curterm (termp=termp@entry=0xa43150) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80      {
(gdb) c
Continuing.

Breakpoint 1, set_curterm (termp=termp@entry=0xab5870) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80      {
(gdb) c
Continuing.

Breakpoint 1, set_curterm (termp=termp@entry=0xbecb90) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80      {
(gdb) c
Continuing.                                                                    

Breakpoint 1, set_curterm (termp=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80      {

# gdb `which python`

GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1

[...]

(gdb) b set_curterm

Function "set_curterm" not defined.

Make breakpoint pending on future shared library load? (y or [n]) y

Breakpoint 1 (set_curterm) pending.

(gdb) r cachetop.py

Starting program: /usr/bin/python cachetop.py

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, set_curterm (termp=termp@entry=0xa43150) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80

80 {

(gdb) c

Continuing.

Breakpoint 1, set_curterm (termp=termp@entry=0xab5870) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80

80 {

(gdb) c

Continuing.

Breakpoint 1, set_curterm (termp=termp@entry=0xbecb90) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80

80 {

(gdb) c

Continuing.

Breakpoint 1, set_curterm (termp=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80

80 {

這裡我用 set 命令來改寫記憶體，把零換成在前面看到的 set_curterm() 引數 0xbecb90 ，希望它仍是合法的。

警告：寫記憶體不安全！gdb 不會問你 “你確定？”。如果你寫錯了或者敲錯了，會搞壞程式。最好的情況是你的程式立即奔潰，你意識到自己做錯了。最糟的情況，程式使用壞的資料繼續執行幾年之後被發現是錯的。

這裡，我在不用於生產的實驗室機器上做試驗，所以我繼續。
我以16進位制（p/x）的形式列印 %rdi 的值，然後將其設為之前的地址，再列印一次，最後列印所有暫存器的值：

(gdb) p/x $rdi
$1 = 0x0
(gdb) set $rdi=0xbecb90
(gdb) p/x $rdi
$2 = 0xbecb90
(gdb) i r
rax            0x100    256
rbx            0x1  1
rcx            0xe71    3697
rdx            0x0  0
rsi            0x7ffff5dd45d3   140737318307283
rdi            0xbecb90 12503952
rbp            0x100    0x100
rsp            0x7fffffffa5b8   0x7fffffffa5b8
r8             0xbf0050 12517456
r9             0x1999999999999999   1844674407370955161
r10            0xbf0040 12517440
r11            0x7ffff7bb4b78   140737349634936
r12            0xbecb70 12503920
r13            0xbeaea0 12496544
r14            0x7fffffffa9a0   140737488333216
r15            0x7fffffffa8a0   140737488332960
rip            0x7ffff3c76a80   0x7ffff3c76a80 <set_curterm>
eflags         0x246    [ PF ZF IF ]
cs             0x33 51
ss             0x2b 43
ds             0x0  0
es             0x0  0
fs             0x0  0
gs             0x0  0

(gdb) p/x $rdi

$1 = 0x0

(gdb) set $rdi=0xbecb90

(gdb) p/x $rdi

$2 = 0xbecb90

(gdb) i r

rax 0x100 256

rbx 0x1 1

rcx 0xe71 3697

rdx 0x0 0

rsi 0x7ffff5dd45d3 140737318307283

rdi 0xbecb90 12503952

rbp 0x100 0x100

rsp 0x7fffffffa5b8 0x7fffffffa5b8

r8 0xbf0050 12517456

r9 0x1999999999999999 1844674407370955161

r10 0xbf0040 12517440

r11 0x7ffff7bb4b78 140737349634936

r12 0xbecb70 12503920

r13 0xbeaea0 12496544

r14 0x7fffffffa9a0 140737488333216

r15 0x7fffffffa8a0 140737488332960

rip 0x7ffff3c76a80 0x7ffff3c76a80 <set_curterm>

eflags 0x246 [ PF ZF IF ]

cs 0x33 51

ss 0x2b 43

ds 0x0 0

es 0x0 0

fs 0x0 0

gs 0x0 0

（因為這裡我已經安裝了除錯資訊，因此不必使用暫存器，我可以設定傳給 set_curterm() 的引數引數 “termp”，而不是 $rdi。）

現在 %rdi 被用到了，所以那些暫存器看起來還能繼續用。

(gdb) c
Continuing.

Breakpoint 1, set_curterm (termp=termp@entry=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80  {

(gdb) c

Continuing.

Breakpoint 1, set_curterm (termp=termp@entry=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80

80 {

好的，在呼叫 set_curterm() 時程式沒崩！但遇到另一個引數也是零的問題。我們故技重施：

(gdb) set $rdi=0xbecb90
(gdb) c
Continuing.
warning: JITed object file architecture unknown is not compatible with target architecture i386:x86-64.

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff34ad411 in ClrBlank (win=0xaea060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1129
1129        if (back_color_erase)

(gdb) set $rdi=0xbecb90

(gdb) c

Continuing.

warning: JITed object file architecture unknown is not compatible with target architecture i386:x86-64.

Program received signal SIGSEGV, Segmentation fault.

0x00007ffff34ad411 in ClrBlank (win=0xaea060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1129

1129 if (back_color_erase)

啊。這就是我寫記憶體的後果。所以這次試驗以另一個段錯誤結束。

20. 條件斷點

在前面一節，我用了 3 個 continues 到達斷點的正確呼叫處。如果有幾百次呼叫的話，就得用條件斷點了。這裡有個例子。

和之前一樣我執行程式，在 set_curterm() 下斷點：

# gdb `which python`
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1                                  
[...]
(gdb) b set_curterm
Function "set_curterm" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (set_curterm) pending.
(gdb) r cachetop.py
Starting program: /usr/bin/python cachetop.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, set_curterm (termp=termp@entry=0xa43150) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80  {

# gdb `which python`

GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1

[...]

(gdb) b set_curterm

Function "set_curterm" not defined.

Make breakpoint pending on future shared library load? (y or [n]) y

Breakpoint 1 (set_curterm) pending.

(gdb) r cachetop.py

Starting program: /usr/bin/python cachetop.py

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, set_curterm (termp=termp@entry=0xa43150) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80

80 {

現在我要將 1 號斷點變成條件斷點，這樣它只會在 %rdi 的值為零是被觸發：

(gdb) cond 1 $rdi==0x0
(gdb) i b
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x00007ffff3c76a80 in set_curterm at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
    stop only if $rdi==0x0
    breakpoint already hit 1 time
(gdb) c
Continuing.

Breakpoint 1, set_curterm (termp=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
(gdb)

(gdb) cond 1 $rdi==0x0

(gdb) i b

Num Type Disp Enb Address What

1 breakpoint keep y 0x00007ffff3c76a80 in set_curterm at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80

stop only if $rdi==0x0

breakpoint already hit 1 time

(gdb) c

Continuing.

Breakpoint 1, set_curterm (termp=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80

(gdb)

漂亮！cond 是 conditional 的簡寫。為什麼當我第一次建立 “pending” 斷點的時候沒有立即執行它呢？因為我發現在 pending 斷點上條件不管用，至少在這個版本的 gdb 上是這樣。（要麼是我哪裡做錯了。）我也用 i b （info breakpoints）列出了斷點資訊。

21. 返回命令

我曾經試過另一個改值的方法，但是這次我要改指令而不是資料。

警告：看前邊的警告，這裡也適用。

和之前一樣我們來到 set_curterm 零斷點處，然後敲入 ret （return 的簡寫），就會立即從此函式返回並且不執行這個函式。我想用不執行函式的方式讓全域性變數 curterm 不被置零。

[...]
(gdb) c
Continuing.

Breakpoint 1, set_curterm (termp=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80

(gdb) ret
Make set_curterm return now? (y or n) y
#0  0x00007ffff5a44e75 in llvm::sys::Process::FileDescriptorHasColors(int) () from /usr/lib/x86_64-linux-gnu/libbcc.so.0
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
                                                    _nc_free_termtype (ptr=ptr@entry=0x100) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/free_ttype.c:52

52      FreeIfNeeded(ptr->str_table);

[...]

(gdb) c

Continuing.

Breakpoint 1, set_curterm (termp=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80

(gdb) ret

Make set_curterm return now? (y or n) y

#0 0x00007ffff5a44e75 in llvm::sys::Process::FileDescriptorHasColors(int) () from /usr/lib/x86_64-linux-gnu/libbcc.so.0

(gdb) c

Continuing.

Program received signal SIGSEGV, Segmentation fault.

_nc_free_termtype (ptr=ptr@entry=0x100) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/free_ttype.c:52

52 FreeIfNeeded(ptr->str_table);

又崩了。這是我搞砸的現場。

再試一次。在多看了一點程式碼之後，我想第二次嘗試 ret，以防父函式被捲進來。再來一次，這只是一次非常規試驗：

[...]
(gdb) c
Continuing.

Breakpoint 1, set_curterm (termp=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80
80  {
(gdb) ret
Make set_curterm return now? (y or n) y
#0  0x00007ffff5a44e75 in llvm::sys::Process::FileDescriptorHasColors(int) () from /usr/lib/x86_64-linux-gnu/libbcc.so.0
(gdb) ret
Make selected stack frame return now? (y or n) y
#0  0x00007ffff45cabb8 in clang::driver::tools::Clang::ConstructJob(clang::driver::Compilation&, clang::driver::JobAction const&, clang::driver::InputInfo const&, llvm::SmallVector const&, llvm::opt::ArgList const&, char const*) const () from /usr/lib/x86_64-linux-gnu/libbcc.so.0


(gdb) c

[...]

(gdb) c

Continuing.

Breakpoint 1, set_curterm (termp=0x0) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tinfo/lib_cur_term.c:80

80 {

(gdb) ret

Make set_curterm return now? (y or n) y

#0 0x00007ffff5a44e75 in llvm::sys::Process::FileDescriptorHasColors(int) () from /usr/lib/x86_64-linux-gnu/libbcc.so.0

(gdb) ret

Make selected stack frame return now? (y or n) y

#0 0x00007ffff45cabb8 in clang::driver::tools::Clang::ConstructJob(clang::driver::Compilation&, clang::driver::JobAction const&, clang::driver::InputInfo const&, llvm::SmallVector const&, llvm::opt::ArgList const&, char const*) const () from /usr/lib/x86_64-linux-gnu/libbcc.so.0

(gdb) c

螢幕清空暫停…然後重新整理：

07:44:22 Buffers MB: 61 / Cached MB: 1246
PID      UID      CMD              HITS     MISSES   DIRTIES  READ_HIT%  WRITE_HIT%
    2742 root     systemd-logind          3       66        2       1.4%      95.7%
   15836 root     kworker/u30:1           7        0        1      85.7%       0.0%
    2736 messageb dbus-daemon             8       66        2       8.1%      89.2%
       1 root     systemd                15        0        0     100.0%       0.0%
    2812 syslog   rs:main Q:Reg          16       66        8       9.8%      80.5%
     435 root     systemd-journal        32       66        8      24.5%      67.3%
    2740 root     accounts-daemon       113       66        2      62.0%      36.9%
   15847 root     bash                  160        0        1      99.4%       0.0%
   15864 root     lesspipe              306        0        2      99.3%       0.0%
   15854 root     bash                  309        0        2      99.4%       0.0%
   15856 root     bash                  309        0        2      99.4%       0.0%
   15866 root     bash                  309        0        2      99.4%       0.0%
   15867 root     bash                  309        0        2      99.4%       0.0%
   15860 root     bash                  313        0        2      99.4%       0.0%
   15868 root     bash                  341        0        2      99.4%       0.0%
   15858 root     uname                 452        0        2      99.6%       0.0%
   15858 root     bash                  453        0        2      99.6%       0.0%
   15866 root     dircolors             464        0        2      99.6%       0.0%
   15861 root     basename              465        0        2      99.6%       0.0%
   15864 root     dirname               468        0        2      99.6%       0.0%
   15856 root     ls                    476        0        2      99.6%       0.0%
[...]

07:44:22 Buffers MB: 61 / Cached MB: 1246

PID UID CMD HITS MISSES DIRTIES READ_HIT% WRITE_HIT%

2742 root systemd-logind 3 66 2 1.4% 95.7%

15836 root kworker/u30:1 7 0 1 85.7% 0.0%

2736 messageb dbus-daemon 8 66 2 8.1% 89.2%

1 root systemd 15 0 0 100.0% 0.0%

2812 syslog rs:main Q:Reg 16 66 8 9.8% 80.5%

435 root systemd-journal 32 66 8 24.5% 67.3%

2740 root accounts-daemon 113 66 2 62.0% 36.9%

15847 root bash 160 0 1 99.4% 0.0%

15864 root lesspipe 306 0 2 99.3% 0.0%

15854 root bash 309 0 2 99.4% 0.0%

15856 root bash 309 0 2 99.4% 0.0%

15866 root bash 309 0 2 99.4% 0.0%

15867 root bash 309 0 2 99.4% 0.0%

15860 root bash 313 0 2 99.4% 0.0%

15868 root bash 341 0 2 99.4% 0.0%

15858 root uname 452 0 2 99.6% 0.0%

15858 root bash 453 0 2 99.6% 0.0%

15866 root dircolors 464 0 2 99.6% 0.0%

15861 root basename 465 0 2 99.6% 0.0%

15864 root dirname 468 0 2 99.6% 0.0%

15856 root ls 476 0 2 99.6% 0.0%

[...]

哇！成功了！

22. 更好的方案

我已經把除錯輸出釋出到 github，因為 BPF 首席工程師，Alexei Starovoitov 對 llvm 也很精通，問題的根源好像是 llvm 的一個 bug。當我在用寫記憶體和返回命令瞎搞的時候，他建議我在 bcc 加上 llvm 選項 -fno-color-diagnostics，來避免這個問題。成功了！把它加到 bcc 裡是一個解決辦法。（我還是希望 llvm 的 bug 能被修復）

23. Python 環境

至此問題已經解決了，但是你可能會好奇想看修復好的堆疊回溯。

安裝 python-dbg：

# apt-get install -y python-dbg
Reading package lists... Done
[...]
The following additional packages will be installed:
  libpython-dbg libpython2.7-dbg python2.7-dbg
Suggested packages:
  python2.7-gdbm-dbg python2.7-tk-dbg python-gdbm-dbg python-tk-dbg
The following NEW packages will be installed:
  libpython-dbg libpython2.7-dbg python-dbg python2.7-dbg
0 upgraded, 4 newly installed, 0 to remove and 20 not upgraded.
Need to get 11.9 MB of archives.
After this operation, 36.4 MB of additional disk space will be used.
[...]

# apt-get install -y python-dbg

Reading package lists... Done

[...]

The following additional packages will be installed:

libpython-dbg libpython2.7-dbg python2.7-dbg

Suggested packages:

python2.7-gdbm-dbg python2.7-tk-dbg python-gdbm-dbg python-tk-dbg

The following NEW packages will be installed:

libpython-dbg libpython2.7-dbg python-dbg python2.7-dbg

0 upgraded, 4 newly installed, 0 to remove and 20 not upgraded.

Need to get 11.9 MB of archives.

After this operation, 36.4 MB of additional disk space will be used.

[...]

現在我回到 gdb 來看堆疊回溯：

# gdb `which python` /var/cores/core.python.30520
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
[...]
Reading symbols from /usr/bin/python...Reading symbols from /usr/lib/debug/.build-id/4e/a0539215b2a9e32602f81c90240874132c1a54.debug...done.
[...]
(gdb) bt
#0  ClrBlank (win=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1129
#1  ClrUpdate () at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1147
#2  doupdate () at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/tty/tty_update.c:1010
#3  0x00007f0a37aa07e6 in wrefresh (win=win@entry=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/base/lib_refresh.c:65
#4  0x00007f0a37a99499 in recur_wrefresh (win=win@entry=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/base/lib_getch.c:384
#5  0x00007f0a37a99616 in _nc_wgetch (win=win@entry=0x1993060, result=result@entry=0x7ffd33d93e24, use_meta=1)
    at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/base/lib_getch.c:491
#6  0x00007f0a37a9a325 in wgetch (win=0x1993060) at /build/ncurses-pKZ1BN/ncurses-6.0+20160213/ncurses/base/lib_getch.c:672
#7  0x00007f0a37cc6ec3 in PyCursesWindow_GetCh.lto_priv.109 (self=0x7f0a3c57b198, args=()) at /build/python2.7-HpIZBG/python2.7-2.7.11/Modules/_cursesmodule.c:853
#8  0x00000000004c4d5a in call_function (oparg=<optimized out>, pp_stack=0x7ffd33d93f30) at ../Python/ceval.c:4350
#9  PyEval_EvalFrameEx () at ../Python/ceval.c:2987
#10 0x00000000004c2e05 in PyEval_EvalCodeEx () at ../Python/ceval.c:3582
#11 0x00000000004def08 in function_call.lto_priv () at ../Objects/funcobject.c:523
#12 0x00000000004b1153 in PyObject_Call () at ../Objects/abstract.c:2546
#13 0x00000000004c73ec in ext_do_call (nk=0, na=<optimized out>, flags=<optimized out>, pp_stack=0x7ffd33d941e8, func=<function at remote 0x7f0a37edcc80>)
    at ../Python/ceval.c:4662
#14 PyEval_EvalFrameEx () at ../Python/ceval.c:3026
#15 0x00000000004c2e05 in PyEval_EvalCodeEx () at ../Python/ceval.c:3582
#16 0x00000000004caf42 in fast_function (nk=0, na=<optimized out>, n=<optimized out>, pp_stack=0x7ffd33d943f0, func=<function at remote 0x7f0a38039140>)
    at ../Python/ceval.c:4445
#17 call_function (oparg=<optimized out>, pp_stack=0x7ffd33d943f0) at ../Python/ceval.c:4370
#18 PyEval_EvalFrameEx () at ../Python/ceval.c:2987
#19 0x00000000004c2e05 in PyEval_EvalCodeEx () at ../Python/ceval.c:3582
#20 0x00000000004c2ba9 in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at ../Python/ceval.c:669
#21 0x00000000004f20ef in run_mod.lto_priv () at ../Python/pythonrun.c:1376
#22 0x00000000004eca72 in PyRun_FileExFlags () at ../Python/pythonrun.c:1362
#23 0x00000000004eb1f1 in PyRun_SimpleFileExFlags () at ../Python/pythonrun.c:948
#24 0x000000000049e18a in Py_Main () at ../Modules/main.c:640
#25 0x00007f0a3be10830 in __libc_start_main (main=0x49daf0 <main>, argc=2, argv=0x7ffd33d94838, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7ffd33d94828) at ../csu/libc-start.c:291
#26 0x000000000049da19 in _start ()

# gdb `which python` /var/cores/core.python.30520

GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1

[...]

Reading symbols from /usr/bin/python...Reading symbols from /usr/lib/debug/.build-id/4e/a0539215b2a9e32602f81c90240874132c1a54.debug...done.