前言
以前在我的PPTpython高階程式設計也提到了一些關於ipython的用法. 今天繼續由淺入深的看看ipython, 本文作為讀者的你已經知道ipython並且用了一段時間了.
%run
這是一個magic命令, 能把你的指令碼里面的程式碼執行, 並且把對應的執行結果存入ipython的環境變數中:
1 2 3 4 5 6 7 8 9 |
$cat t.py # coding=utf-8 l = range(5) $ipython In [1]: %run t.py # `%`可加可不加 In [2]: l # 這個l本來是t.py裡面的變數, 這裡直接可以使用了 Out[2]: [0, 1, 2, 3, 4] |
alias
1 2 3 4 5 6 7 8 9 |
In [3]: %alias largest ls -1sSh | grep %s In [4]: largest to total 42M 20K tokenize.py 16K tokenize.pyc 8.0K story.html 4.0K autopep8 4.0K autopep8.bak 4.0K story_layout.html |
PS 別名需要儲存的, 否則重啟ipython就不存在了:
1 2 |
In [5]: %store largest Alias stored: largest (ls -1sSh | grep %s) |
下次進入的時候%store -r
bookmark – 對目錄做別名
1 2 3 4 5 6 7 8 9 10 |
In [2]: %pwd Out[2]: u'/home/vagrant' In [3]: %bookmark dongxi ~/shire/dongxi In [4]: %cd dongxi /home/vagrant/shire/dongxi_code In [5]: %pwd Out[5]: u'/home/vagrant/shire/dongxi_code' |
ipcluster – 平行計算
其實ipython提供的方便的平行計算的功能. 先回答ipython做平行計算的特點:
1. $wget http://www.gutenberg.org/files/27287/27287-0.txt
第一個版本是直接的, 大家習慣的用法.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
In [1]: import re In [2]: import io In [3]: non_word = re.compile(r'[\W\d]+', re.UNICODE) In [4]: common_words = { ...: 'the','of','and','in','to','a','is','it','that','which','as','on','by', ...: 'be','this','with','are','from','will','at','you','not','for','no','have', ...: 'i','or','if','his','its','they','but','their','one','all','he','when', ...: 'than','so','these','them','may','see','other','was','has','an','there', ...: 'more','we','footnote', 'who', 'had', 'been', 'she', 'do', 'what', ...: 'her', 'him', 'my', 'me', 'would', 'could', 'said', 'am', 'were', 'very', ...: 'your', 'did', 'not', ...: } In [5]: def yield_words(filename): ...: import io ...: with io.open(filename, encoding='latin-1') as f: ...: for line in f: ...: for word in line.split(): ...: word = non_word.sub('', word.lower()) ...: if word and word not in common_words: ...: yield word ...: In [6]: def word_count(filename): ...: word_iterator = yield_words(filename) ...: counts = {} ...: counts = defaultdict(int) ...: while True: ...: try: ...: word = next(word_iterator) ...: except StopIteration: ...: break ...: else: ...: counts[word] += 1 ...: return counts ...: In [6]: from collections import defaultdict # 腦殘了 忘記放進去了.. In [7]: %time counts = word_count(filename) CPU times: user 88.5 ms, sys: 2.48 ms, total: 91 ms Wall time: 89.3 ms |
現在用ipython來跑一下:
1 |
ipcluster start -n 2 # 好吧, 我的Mac是雙核的 |
先講下ipython 平行計算的用法:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
In [1]: from IPython.parallel import Client # import之後才能用%px*的magic In [2]: rc = Client() In [3]: rc.ids # 因為我啟動了2個程式 Out[3]: [0, 1] In [4]: %autopx # 如果不自動 每句都需要: `%px xxx` %autopx enabled In [5]: import os # 這裡沒autopx的話 需要: `%px import os` In [6]: print os.getpid() # 2個程式的pid [stdout:0] 62638 [stdout:1] 62636 In [7]: %pxconfig --targets 1 # 在autopx下 這個magic不可用 [stderr:0] ERROR: Line magic function `%pxconfig` not found. [stderr:1] ERROR: Line magic function `%pxconfig` not found. In [8]: %autopx # 再執行一次就會關閉autopx %autopx disabled In [10]: %pxconfig --targets 1 # 指定目標物件, 這樣下面執行的程式碼就會只在第2個程式下執行 In [11]: %%px --noblock # 其實就是執行一段非阻塞的程式碼 ....: import time ....: time.sleep(1) ....: os.getpid() ....: Out[11]: <AsyncResult: execute> In [12]: %pxresult # 看 只返回了第二個程式的pid Out[1:21]: 62636 In [13]: v = rc[:] # 使用全部的程式, ipython可以細粒度的控制那個engine執行的內容 In [14]: with v.sync_imports(): # 每個程式都匯入time模組 ....: import time ....: importing time on engine(s) In [15]: def f(x): ....: time.sleep(1) ....: return x * x ....: In [16]: v.map_sync(f, range(10)) # 同步的執行 Out[16]: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] In [17]: r = v.map(f, range(10)) # 非同步的執行 In [18]: r.ready(), r.elapsed # celery的用法 Out[18]: (True, 5.87735) In [19]: r.get() # 獲得執行的結果 Out[19]: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] |
入正題:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
In [20]: def split_text(filename): ....: text = open(filename).read() ....: lines = text.splitlines() ....: nlines = len(lines) ....: n = 10 ....: block = nlines//n ....: for i in range(n): ....: chunk = lines[i*block:(i+1)*(block)] ....: with open('count_file%i.txt' % i, 'w') as f: ....: f.write('\n'.join(chunk)) ....: cwd = os.path.abspath(os.getcwd()) ....: fnames = [ os.path.join(cwd, 'count_file%i.txt' % i) for i in range(n)] # 不用glob是為了精準 ....: return fnames In [21]: from IPython import parallel In [22]: rc = parallel.Client() In [23]: view = rc.load_balanced_view() In [24]: v = rc[:] In [25]: v.push(dict( ....: non_word=non_word, ....: yield_words=yield_words, ....: common_words=common_words ....: )) Out[25]: <AsyncResult: _push> In [26]: fnames = split_text(filename) In [27]: def count_parallel(): .....: pcounts = view.map(word_count, fnames) .....: counts = defaultdict(int) .....: for pcount in pcounts.get(): .....: for k, v in pcount.iteritems(): .....: counts[k] += v .....: return counts, pcounts .....: In [28]: %time counts, pcounts = count_parallel() # 這個時間包含了我再聚合的時間 CPU times: user 47.6 ms, sys: 6.67 ms, total: 54.3 ms # 是不是比直接執行少了很多時間? Wall time: 106 ms # 這個時間是 In [29]: pcounts.elapsed, pcounts.serial_time, pcounts.wall_time Out[29]: (0.104384, 0.13980499999999998, 0.104384) |
更多地關於平行計算請看這裡: Parallel Computing with IPython