Python演算法:基礎知識

發表於2015-05-20

本節主要介紹了三個內容:演算法漸近執行時間的表示方法、六條演算法效能評估的經驗以及Python中樹和圖的實現方式。

1.計算模型

圖靈機模型(Turing machine): A Turing machine is a simple (abstract) device that can read from, write to, and move along an infinitely long strip of paper. The actual behavior of the machines varies. Each is a so-called finite state machine: it has a finite set of states (some of which indicate that it has finished), and every symbol it reads potentially triggers reading and/or writing and switching to a different state. You can think of this machinery as a set of rules. (“If I am in state 4 and see an X, I move one step to the left, write a Y, and switch to state 9.”)

RAM模型(random-access machine):標準的單核計算機,它大致有下面三個性質

• We don’t have access to any form of concurrent execution; the machine simply executes one instruction after the other.

計算機不能併發執行而只是按照指令順序依次執行指令。

• Standard, basic operations (such as arithmetic, comparisons, and memory access) all take constant (although possibly different) amounts of time. There are no more complicated basic operations (such as sorting).

基本的操作都是常數時間完成的,沒有其他的複雜操作。

• One computer word (the size of a value that we can work with in constant time) is not unlimited but is big enough to address all the memory locations used to represent our problem, plus an extra percentage for our variables.

計算機的字長足夠大以使得它能夠訪問所有的記憶體地址。

演算法的本質: An algorithm is a procedure, consisting of a finite set of steps (possibly including loops and conditionals) that solves a given problem in finite time.

the notion of running time complexity (as described in the next section) is based on knowing how big a problem instance is, and that size is simply the amount of memory needed to encode it.

[演算法的執行時間是基於問題的大小,這個大小是指問題的輸入佔用的記憶體空間大小]

2.演算法漸近執行時間

主要介紹了大O符號、大Ω符號以及大Θ符號,這部分內容網上很多資料,大家也都知道了,此處略過,可以參考wikipedia_大O符號

演算法導論介紹到,對於三個符號可以做如下理解:O = ≤,Ω = ≥, Θ = =

執行時間的三種特殊的情況:最優情況,最差情況,平均情況

幾種常見的執行時間以及演算法例項 點選這裡可以參考下wiki中的時間複雜度

3.演算法效能評估的經驗

(1)Tip 1: If possible, don’t worry about it.

如果暴力求解也還行就算了吧,別去擔心了

(2)Tip 2: For timing things, use timeit.

使用timeit模組對執行時間進行分析,在前面的資料結構篇中第三部分資料結構的list中已經介紹過了timeit模組,在使用的時候需要注意前面的執行不會影響後面的重複的執行(例如,分析排序演算法執行時間,如果將前面已經排好序的序列傳遞給後面的重複執行是不行的)

(3)Tip 3: To find bottlenecks, use a profiler.

使用cProfile模組來獲取更多的關於執行情況的內容,從而可以發現問題的瓶頸,如果系統沒有cProfile模組,可以使用profile模組代替,關於這兩者的更多內容可以檢視Python standard library-Python Profilers

(4)Tip 4: Plot your results.

畫出演算法效能結果圖,如下圖所示,可以使用的模組有matplotlib

(5)Tip 5: Be careful when drawing conclusions based on timing comparisons.

在對基於執行時間的比較而要下結論時需要小心

First, any differences you observe may be because of random variations.

首先,你觀察到的差異可能是由於輸入中的隨機變化而引起的

Second, there are issues when comparing averages.

其次,比較演算法的平均情況下的執行時間是存在問題的[這個我未理解,以下是作者的解釋]

At the very least, you should stick to comparing averages of actual timings. A common practice to get more meaningful numbers when performing timing experiments is to normalize the running time of each program, dividing it by the running time of some standard, simple algorithm. This can indeed be useful but can in some cases make your results less than meaningful. See the paper “How not to lie with statistics: The correct way to summarize benchmark results” by Fleming and Wallace for a few pointers. For some other perspectives, you could read Bast and Weber’s “Don’t compare averages,” or the more recent paper by Citron et al., “The harmonic or geometric mean: does it really matter?”

Third, your conclusions may not generalize.

最後,你下的結論不要太過於寬泛

(6)Tip 6: Be careful when drawing conclusions about asymptotics from experiments.

在對從實驗中得到關於漸近時間的資訊下結論時需要小心,實驗只是對於理論的一個支撐,可以通過實驗來推翻一個漸近時間結果的假設,但是反過來一般不行 [以下是作者的解釋]

If you want to say something conclusively about the asymptotic behavior of an algorithm, you need to analyze it, as described earlier in this chapter. Experiments can give you hints, but they are by their nature finite, and asymptotics deal with what happens for arbitrarily large data sizes. On the other hand, unless you’re working in theoretical computer science, the purpose of asymptotic analysis is to say something about the behavior of the algorithm when implemented and run on actual problem instances, meaning that experiments should be relevant.

4.在Python中實現樹和圖

[Python中的dict和set]
Python中很多地方都使用了hash策略,在前面的Python資料結構篇中的搜尋部分已經介紹了hash的內容。Python提供了hash函式,例如hash("Hello, world!")得到-943387004357456228 (結果不一定相同)。Python中的dict和set都使用了hash機制,所以平均情況下它們獲取元素都是常數時間的。

(1)圖的表示:最常用的兩種表示方式是鄰接表和鄰接矩陣 [假設要表示的圖如下]

鄰接表 Adjacency Lists:因為歷史原因,鄰接表往往都是指連結串列list,但實際上也可以是其他的,例如在python中也可以是set或者dict,不同的表示方式有各自的優缺點,它們判斷節點的連線關係和節點的度的方式甚至兩個操作的效能都不太一樣。

① adjacency lists 表示形式

② adjacency sets 表示形式

基本上和adjacency lists表示形式一樣對吧?但是,對於list,判斷一個元素是否存在是線性時間O(N(v)),而在set中是常數時間O(1),所以對於稠密圖使用adjacency sets要更加高效。

③ adjacency dicts 表示形式

這種情況下如果邊是帶權值的都沒有問題!

除了上面三種方式外,還可以改變外層資料結構,上面三個都是list,其實也可以使用dict,例如下面的程式碼,此時節點是用字母表示的。在實際應用中,要根據問題選擇最合適的表示形式。

鄰接矩陣 Adjacency Matrix

使用巢狀的list,用1和0表示點和點之間的連線關係,此時對於它們的連線性判斷時間是常數,但是對於度的計算時間是線性的

如果邊帶有權值,也可以使用權值代替1,用inf代替0

NumPy:這裡作者提到了一個最常用的數值計算模組NumPy,它包含了很多與多維陣列計算有關的函式。我可能會在以後的機器學習中詳細學習它的使用,到時候可能會寫篇文章介紹它的使用

(2)樹的表示 [假設要表示下面的樹]

樹是一種特殊的圖,所以可以使用圖的表示方法,但是因為樹的特殊性,其實有其他更好的表示方法,最簡單的就是直接用一個list即可,缺點也很明顯,可讀性太差了,相當不直觀

很多時候我們都能夠肯定樹中節點的孩子節點個數最多有多少個(比如二叉樹,三叉樹等等),所以比較方便的實現方式就是使用類class

上面的實現方式的子節點都是孩子節點,但是還有一種很常用的樹的表示方式,那就是“左孩子,右兄弟”表示形式,它就適用於孩子節點數目不確定的情況

[Bunch Pattern]:有意思的是,上面的實現方式使用了Python中一種常用的設計模式,叫做Bunch Pattern,貌似來自經典書籍Python Cookbook,原書介紹如下:

[因為這個不太好理解和翻譯,還是原文比較有味,後期等我深刻理解了我可能會詳細介紹它]

When prototyping (or even finalizing) data structures such as trees, it can be useful to have a flexible class that will allow you to specify arbitrary attributes in the constructor. In these cases, the “Bunch” pattern (named by Alex Martelli in the Python Cookbook) can come in handy. There are many ways of implementing it, but the gist of it is the following:

There are several useful aspects to this pattern. First, it lets you create and set arbitrary attributes by supplying them as command-line arguments:

Second, by subclassing dict, you get lots of functionality for free, such as iterating over the keys/attributes or easily checking whether an attribute is present. Here’s an example:

This pattern isn’t useful only when building trees, of course. You could use it for any situation where you’d want a flexible object whose attributes you could set in the constructor.


[與圖有關的python模組]

• NetworkX: http://networkx.lanl.gov

• python-graph: http://code.google.com/p/python-graph

• Graphine: http://gitorious.org/projects/graphine/pages/Home

• Pygr: a graph database http://bioinfo.mbi.ucla.edu/pygr

• Gato: a graph animation toolbox http://gato.sourceforge.net

• PADS: a collection of graph algorithms http://www.ics.uci.edu/~eppstein/PADS


5.Python程式設計中的一些細節

In general, the more important your program, the more you should mistrust such black boxes and seek to find out what’s going on under the cover.

作者在這裡提到,如果你的程式越是重要的話,你就越是需要明白你所使用的資料結構的內部實現,甚至有些時候你要自己重新實現它。

(1)Hidden Squares 隱藏的平方執行時間

有些情況下我們可能沒有注意到我們的操作是非常不高效的,例如下面的程式碼,如果是判斷某個元素是否在list中執行時間是線性的,如果是使用set,判斷某個元素是否存在只需要常數時間,所以如果我們需要判斷很多元素是否存在的話,使用set的效能會更加高效。

(2)The Trouble with Floats 精度帶來的煩惱

現有的計算機系統都是不能精確表達小數的![該部分內容可以閱讀與計算機組成原理相關的書籍瞭解計算機的浮點數系統]在python中,浮點數可能帶來很多的煩惱,例如,執行下面的例項,本應該是相等,但是卻返回False。

永遠不要使用小數比較結果來作為兩者相等的判斷依據!你最多隻能判斷兩個浮點數在有限位數上是相等的,也就是近似相等了。

除此之外,可以使用一些有用的第三方模組,例如decimal,在需要處理金融資料的時候很有幫助

還有一個有用的Sage模組,如下所示,它可以進行數學的符號運算得到準確值,如果需要也可以得到近似的浮點數解。Sage的官方網址

更多和Python中的浮點數有關的內容可以檢視Floating Point Arithmetic: Issues and Limitations

問題2-12. (圖的表示)

Consider the following graph representation: you use a dictionary and let each key be a pair (tuple) of two nodes, with the corresponding value set to the edge weight. For example W[u, v] = 42. What would be the advantages and disadvantages of this representation? Could you supplement it to mitigate the downsides?

The advantages and disadvantages depend on what you’re using it for. It works well for looking up edge weights efficiently but less well for iterating over the graph’s nodes or a node’s neighbors, for example. You could improve that part by using some extra structures (for example, a global list of nodes, if that’s what you need or a simple adjacency list structure, if that’s required).

相關文章