谷歌PageRank演算法詳解

segmentfault發表於2014-10-08

PageRank演算法

PageRank演算法是谷歌曾經獨步天下的“倚天劍”,該演算法由Larry Page和Sergey Brin在史丹佛大學讀研時發明的,論文點選下載: The PageRank Citation Ranking: Bringing Order to the Web

本文首先通過一些參考文獻引出問題,然後給出了PageRank的幾種實現演算法,最後將其推廣至在MapReduce框架下如何實現PageRank演算法。

PageRank的核心思想有2點:

1.如果一個網頁被很多其他網頁連結到的話說明這個網頁比較重要,也就是pagerank值會相對較高;

2.如果一個pagerank值很高的網頁連結到一個其他的網頁,那麼被連結到的網頁的pagerank值會相應地因此而提高。

下面是一張來自WikiPedia的圖,每個球代表一個網頁,球的大小反應了網頁的pagerank值的大小。指向網頁B和網頁E的連結很多,所以B和E的pagerank值較高,另外,雖然很少有網頁指向C,但是最重要的網頁B指向了C,所以C的pagerank值比E還要大。

image

參考內容:

1.Wiki about PageRank

2.Google 的祕密- PageRank 徹底解說 中文版

3.數值分析與演算法 Page 161 應用例項:Google的PageRank演算法

4.Numeric Methods with Matlab 或者中文翻譯版本Matlab數值計算

5.使用 MapReduce 思想計算 PageRank Page 62 PageRank和馬爾可夫鏈

1.問題背景

來自參考內容3

image

2.數學建模

來自參考內容3,理解網頁連線矩陣$G$,馬爾科夫過程(“網上衝浪”),轉移矩陣$A$,概率$p$為使用者點選當前網頁中的某個連結地址的概率(一般都為0.85)。

image
image

最後得到一個等式$Ax=x$,這實際上就是求矩陣$A$的特徵值為1的特徵向量!

下面的內容使用圓盤定理解釋了1是矩陣$A$的主特徵值,所以我們可以使用冪法來求解。

關於冪法的詳細介紹參考另一篇文章Numerical Methods Using Matlab: 第三章 矩陣特徵值和奇異值求解

image
image

3.求解PageRank

假設有如上圖右側所示的網頁連結模型。

(1) 冪法

wiki上有一個PageRank的簡便演算法,它不考慮轉移概率,而是採用的是迭代的方式,每次都更新所有網頁的pagerank值,更新的方式就是將每個網頁的pagerank值平攤分給它指向的所有網頁,每個網頁累計所有指向它的網頁平攤給它的值作為它該回合的pagerank值,直到全部網頁的pagerank值收斂了或者滿足一定的閾值條件就停止。

後面的MapReduce框架下PageRank演算法的實現就採用了這個思想。考慮轉移概率的情況和這個演算法類似,乘上一個轉移概率再加上一個隨機跳轉的概率。

image

根據上面的思想,下面Matlab程式碼實現可以得到各個網頁的PageRank值。

n=6;
i=[2 3 4 4 5 6 1 6 1];
j=[1 2 2 3 3 3 4 5 6];
G=sparse(i,j,1,n,n);

% Power method
for j = 1:n
   L{j} = find(G(:,j));
   c(j) = length(L{j});
end

p = .85;
delta = (1-p)/n;
x = ones(n,1)/n;
z = zeros(n,1);
cnt = 0;
while max(abs(x-z)) > .0001
   z = x;
   x = zeros(n,1);
   for j = 1:n
      if c(j) == 0
         x = x + z(j)/n;%轉移到任意一個網頁
      else
         x(L{j}) = x(L{j}) + z(j)/c(j);%將上次的pagerank值平攤給所有指向的網頁
      end
   end
   x = p*x + delta;
   cnt = cnt+1;
end

得到的向量$x$儲存了各個網頁的pagerank值,雖然連結數目一樣,但是網頁①比網頁④和網頁⑤都高,而網頁②的pagerank值第二高,因為網頁①連結到了它上面,相當於沾了網頁①的光。

x =
    0.2675
    0.2524
    0.1323
    0.1698
    0.0625
    0.1156

這篇文章給出該演算法的一個Python版本實現,該博主使用第三方模組python-graph,python-graph模組實現了很多圖演算法,該模組的使用示例,使用前需要先安裝,程式碼如下:

easy_install python-graph-core
easy_install python-graph-dot

Python版本的演算法實現:

# coding=utf-8

# python-graph https://code.google.com/p/python-graph/

# Import graphviz
import graphviz as gv

# Import pygraph
from pygraph.classes.digraph import digraph
from pygraph.readwrite.dot import write

# Define pagerank function
def pagerank(graph, damping_factor=0.85, max_iterations=100, \
             min_delta=0.00001):
    """
    Compute and return the PageRank in an directed graph.

    @type  graph: digraph
    @param graph: Digraph.

    @type  damping_factor: number
    @param damping_factor: PageRank dumping factor.

    @type  max_iterations: number
    @param max_iterations: Maximum number of iterations.

    @type  min_delta: number
    @param min_delta: Smallest variation required for a new iteration.

    @rtype:  Dict
    @return: Dict containing all the nodes PageRank.
    """

    nodes = graph.nodes()
    graph_size = len(nodes)
    if graph_size == 0:
        return {}
    # value for nodes without inbound links
    min_value = (1.0-damping_factor)/graph_size

    # itialize the page rank dict with 1/N for all nodes
    #pagerank = dict.fromkeys(nodes, 1.0/graph_size)
    pagerank = dict.fromkeys(nodes, 1.0)

    for i in range(max_iterations):
        diff = 0 #total difference compared to last iteraction
        # computes each node PageRank based on inbound links
        for node in nodes:
            rank = min_value
            for referring_page in graph.incidents(node):
                rank += damping_factor * pagerank[referring_page] / \
                        len(graph.neighbors(referring_page))

            diff += abs(pagerank[node] - rank)
            pagerank[node] = rank

        print 'This is NO.%s iteration' % (i+1)
        print pagerank
        print ''

        #stop if PageRank has converged
        if diff < min_delta:
            break

    return pagerank

# Graph creation
gr = digraph()

# Add nodes and edges
gr.add_nodes(["1","2","3","4"])

gr.add_edge(("1","2"))
gr.add_edge(("1","3"))
gr.add_edge(("1","4"))
gr.add_edge(("2","3"))
gr.add_edge(("2","4"))
gr.add_edge(("3","4"))
gr.add_edge(("4","2"))

# Draw as PNG
# dot = write(gr)
# gvv = gv.readstring(dot)
# gv.layout(gvv,'dot')
# gv.render(gvv,'png','Model.png')

pagerank(gr)

經過32次迭代之後得到的結果如下,和前面的結果一致:

This is NO.32 iteration
{'1': 0.2675338708706491, '3': 0.13227261904986046, '2': 0.2524037902400518, '5': 0.062477242064127136, '4': 0.1697488529161491, '6': 0.1155828978186352}

(2) 利用馬爾可夫矩陣的特殊結構

來自參考內容4,其中$\delta=\frac{1-p}{n}$

image

也就是將矩陣$A$進行分解,並不需要顯示求出矩陣$A$,然後便是求解一個線性方程組即可。

function x = pagerank1(G)
% PAGERANK1  Google's PageRank modified version 1 - hujiawei

%if nargin < 3, p = .85; end
p=0.85;

% Eliminate any self-referential links

G = G - diag(diag(G));

% c = out-degree, r = in-degree

[n,n] = size(G);
c = sum(G,1);%each row's sum
r = sum(G,2);%each col's sum

% Scale column sums to be 1 (or 0 where there are no out links).

k = find(c~=0);
D = sparse(k,k,1./c(k),n,n);

% Solve (I - p*G*D)*x = e

e = ones(n,1);
I = speye(n,n);
x = (I - p*G*D)\e;

% Normalize so that sum(x) == 1.

x = x/sum(x);

(3) 巧妙解法:逆迭代演算法

巧妙利用Matlab中的精度誤差導致原本是一個奇異矩陣的$I-A$變成一個非奇異矩陣,執行時只是會有些警告提示,但是執行結果和其他演算法一樣。

image

function x = pagerank2(G)
% PAGERANK1  Google's PageRank modified version 2 - hujiawei
% using inverse iteration method

%if nargin < 3, p = .85; end
p=0.85;

% Eliminate any self-referential links

G = G - diag(diag(G));

% c = out-degree, r = in-degree

[n,n] = size(G);
c = sum(G,1);%each row's sum
r = sum(G,2);%each col's sum

% Scale column sums to be 1 (or 0 where there are no out links).

k = find(c~=0);
D = sparse(k,k,1./c(k),n,n);

% Solve (I - p*G*D)*x = e

e = ones(n,1);
I = speye(n,n);
% x = (I - p*G*D)\e;
delta=(1-p)/n;
A=p*G*D+delta;
x=(I-A)\e;

% Normalize so that sum(x) == 1.

x = x/sum(x);

最後,附上參考內容4中給出的一份好程式碼,用於模擬隨機衝浪生成矩陣$G$的程式碼

function [U,G] = surfer(root,n)
% SURFER  Create the adjacency graph of a portion of the Web.
%    [U,G] = surfer(root,n) starts at the URL root and follows
%    Web links until it forms an adjacency graph with n nodes.
%    U = a cell array of n strings, the URLs of the nodes.
%    G = an n-by-n sparse matrix with G(i,j)=1 if node j is linked to node i.
%
%    Example:  [U,G] = surfer('http://www.harvard.edu',500);
%    See also PAGERANK.
%
%    This function currently has two defects.  (1) The algorithm for
%    finding links is naive.  We just look for the string 'http:'.
%    (2) An attempt to read from a URL that is accessible, but very slow,
%    might take an unacceptably long time to complete.  In some cases,
%    it may be necessary to have the operating system terminate MATLAB.
%    Key words from such URLs can be added to the skip list in surfer.m.

% Initialize

clf
shg
set(gcf,'doublebuffer','on')
axis([0 n 0 n])
axis square
axis ij
box on
set(gca,'position',[.12 .20 .78 .78])
uicontrol('style','frame','units','normal','position',[.01 .09 .98 .07]);
uicontrol('style','frame','units','normal','position',[.01 .01 .98 .07]);
t1 = uicontrol('style','text','units','normal','position',[.02 .10 .94 .04], ...
   'horiz','left');
t2 = uicontrol('style','text','units','normal','position',[.02 .02 .94 .04], ...
   'horiz','left');
slow = uicontrol('style','toggle','units','normal', ...
   'position',[.01 .24 .07 .05],'string','slow','value',0);
quit = uicontrol('style','toggle','units','normal', ...
   'position',[.01 .17 .07 .05],'string','quit','value',0);

U = cell(n,1);
hash = zeros(n,1);
G = logical(sparse(n,n));
m = 1;
U{m} = root;
hash(m) = hashfun(root);

j = 1;
while j < n & get(quit,'value') == 0

   % Try to open a page.

   try
      set(t1,'string',sprintf('%5d %s',j,U{j}))
      set(t2,'string','');
      drawnow
      page = urlread(U{j});
   catch
      set(t1,'string',sprintf('fail: %5d %s',j,U{j}))
      drawnow
      continue
   end
   if get(slow,'value')
      pause(.25)
   end

   % Follow the links from the open page.

   for f = findstr('http:',page);

      % A link starts with 'http:' and ends with the next quote.

      e = min([findstr('"',page(f:end)) findstr('''',page(f:end))]);
      if isempty(e), continue, end
      url = deblank(page(f:f+e-2));
      url(url<' ') = '!';   % Nonprintable characters
      if url(end) == '/', url(end) = []; end

      % Look for links that should be skipped.

      skips = {'.gif','.jpg','.pdf','.css','lmscadsi','cybernet', ...
               'search.cgi','.ram','www.w3.org', ...
               'scripts','netscape','shockwave','webex','fansonly'};
      skip = any(url=='!') | any(url=='?');
      k = 0;
      while ~skip & (k < length(skips))
         k = k+1;
         skip = ~isempty(findstr(url,skips{k}));
      end
      if skip
         if isempty(findstr(url,'.gif')) & isempty(findstr(url,'.jpg'))
            set(t2,'string',sprintf('skip: %s',url))
            drawnow
            if get(slow,'value')
               pause(.25)
            end
         end
         continue
      end

      % Check if page is already in url list.

      i = 0;
      for k = find(hash(1:m) == hashfun(url))';
         if isequal(U{k},url)
            i = k;
            break
         end
      end

      % Add a new url to the graph there if are fewer than n.

      if (i == 0) & (m < n)
         m = m+1;
         U{m} = url;
         hash(m) = hashfun(url);
         i = m;
      end

      % Add a new link.

      if i > 0
         G(i,j) = 1;
         set(t2,'string',sprintf('%5d %s',i,url))
         line(j,i,'marker','.','markersize',6)
         drawnow
         if get(slow,'value')
            pause(.25)
         end
      end
   end

   j = j+1;
end
delete(t1)
delete(t2)
delete(slow)
set(quit,'string','close','callback','close(gcf)','value',0)

%------------------------

function h = hashfun(url)
% Almost unique numeric hash code for pages already visited.
h = length(url) + 1024*sum(url);

4.MapReduce框架下PageRank演算法的實現

利用前面wiki上的迭代(或者冪法)的思想來實現MapReduce框架下PageRank演算法很簡單,可以先閱讀下參考內容5。

這篇文章using-mapreduce-to-compute-pagerank更加詳細,可以參考

以下是我的大資料的一次作業,要求是參考wiki上的簡便演算法,實現MapReduce框架下的PageRank演算法。給的資料集是Twitter的使用者之間的關係,可以看做是網頁之間的關係,但是助教沒要求寫程式碼以及執行這個資料集(有1G多),所以下面只是一個Python版本的理想可行版本,並沒有通過實際大資料集的驗證,另外,博主暫時還不太會Python的mapreduce框架中的一些函式,所以實現的是一個簡明的可以測試的PageRank演算法。

1.輸入輸出格式

map函式的輸入是<節點,從該節點引出的邊列表>,其中節點是一個類,包含了其當前的pagerank值,輸出是<節點,反向節點pagerank值/反向節點引出邊的總數>;

reduce函式的輸入是<節點,反向節點pagerank值/反向節點引出邊的總數>,輸出是<節點,從該節點引出的邊列表>,其中節點包含了其更新後的pagerank值。

虛擬碼: [一時犯二寫了個英文形式的 ]

process the data to the form of {node i:[its adjacent node list],...}
while the sum of difference between the last two pagerank values < threshold
    map({node i:[its adjacent node list],...}):
        map_output={}
        for every node j in adjacent node list:
            put or sum up {j:(i, PageRank(i)/length(adjacent node list))} into map_output
        return map_output

    reduce(map_output):
        reduce_output={}
        for every entry {j:(i, PageRank(i)/length(adjacent node list))} in map_output:
            put or sum up all values pagerank values for node j with its adjacent node list into reduce_output
        return reduce_output

2.示例演示

假設使用者1,2,3,4是如下圖所示的關係:

image

假設有2個mapper(A和B)和1個reducer(C),初始時4個節點的pagerank值都是0.25

其中,關於使用者1和2的資料被mapperA讀取並處理,關於使用者3和4的資料被mapperB讀取並處理 [經驗證,即使一個使用者的資料是由不同的mapper來讀取的,最終收斂到的結果差不多]

map的輸入輸出結果如下:

image

reduce的輸入輸出結果如下,輸入是2個mapper的輸出,輸出的結果中更新了節點的pagerank值

image

reducer處理完了之後又將它的結果輸入給mapper處理,直到迭代的次數超過了設定值或者兩次迭代之後得到的所有節點的pagerank值之差的總和(也可以是取二範數)小於設定的閾值。

3.示例的實驗結果

(1)首先是使用Matlab採用冪法的方式計算出在p=1.0的情況下示例得到的結果 [它的主要作用是驗證後面python版本的正確性]

matlab原始碼如下:

n=4;
i=[2 3 4 3 4 4 1 2];
j=[1 1 1 2 2 3 3 4];
G=sparse(i,j,1,n,n);

[n,n] = size(G);
for j = 1:n
   L{j} = find(G(:,j));
   c(j) = length(L{j});
end

% Power method
p=1.0;
delta = (1-p)/n;
x = ones(n,1)/n;
z = zeros(n,1);
cnt = 0;
while max(abs(x-z)) > .0001
   z = x;
   x = zeros(n,1);
   for j = 1:n
      if c(j) == 0
         x = x + z(j)/n;
      else
         x(L{j}) = x(L{j}) + z(j)/c(j);
      end
   end
   x = p*x + delta;
   cnt = cnt+1;
end
sprintf('pagerank result:')
x

結果為:

0.1072
0.3571
0.2143
0.3214

(2)matlab版本的page rank沒有采用mapreduce的思想進行迭代,所以我另外寫了一個python版本的利用mapreduce思想實現的pagerank演算法(注:我並沒有使用python的map和reduce函式去實現,而是使用更加容易明白的實現),使用的閾值為0.0001,最多迭代的次數為100次。

# coding=utf-8

__author__ = 'hujiawei'
__doc__ = 'pagerank mapreduce'

class Node:
    def __init__(self,id,pk):
        self.id=id
        self.pk=pk

def pk_map(map_input):
    map_output={}
    for node,outlinks in map_input.items():
        for link in outlinks:
            size=len(outlinks)
            if link in map_output:
                map_output[link]+=(float)(node.pk)/size
            else:
                map_output[link]=(float)(node.pk)/size
    return map_output

def pk_reduce(reduce_input):
    for result in reduce_input:
        for node,value in result.items():
            node.pk+=value

def pk_clear(nodes):
    for node in nodes:
        node.pk=0

def pk_last(nodes):
    lastnodes=[]
    for node in nodes:
        lastnodes.append(Node(node.id,node.pk))
    return lastnodes

def pk_diff(nodes,lastnodes):
    diff=0
    for i in range(len(nodes)):
        print('node pk %f, last node pk %f ' % (nodes[i].pk, lastnodes[i].pk))
        diff+=abs(nodes[i].pk-lastnodes[i].pk)
    return diff

def pk_test1():
    node1 = Node(1, 0.25)
    node2 = Node(2, 0.25)
    node3 = Node(3, 0.25)
    node4 = Node(4, 0.25)
    nodes = [node1, node2, node3, node4]
    threshold = 0.0001
    max_iters = 100

    for iter_count in range(max_iters):
        iter_count += 1
        lastnodes=pk_last(nodes)
        print('============ map count %d =================' % (iter_count))
        in1 = {node1: [node2, node3, node4], node2: [node3, node4]}
        in2 = {node3: [node1, node4], node4: [node2]}

        mapout1 = pk_map(in1)
        mapout2 = pk_map(in2)

        for node, value in mapout1.items():
            print str(node.id) + ' ' + str(value)

        for node, value in mapout2.items():
            print str(node.id) + ' ' + str(value)

        print('============ reduce count %d =================' % (iter_count))

        reducein = [mapout1, mapout2]
        pk_clear(nodes)
        pk_reduce(reducein)

        for node in nodes:
            print str(node.id) + ' ' + str(node.pk)

        diff=pk_diff(nodes,lastnodes)
        if diff < threshold:
            break

if __name__ == '__main__':
    pk_test1()

得到的結果為如下,總共迭代了15次

1 0.107138774577
2 0.35712924859
3 0.214296601128
4 0.321435375705

上面的結果和Matlab用冪法得到的pagerank值差別很小,可以認為是正確的,所以說明了使用這種mapreduce輸入輸出格式的正確性。

OK,差不多了,希望對需要理解PageRank演算法的人有幫助!

相關文章