使用shell抽取html資料之二

yuntui發表於2016-11-03
昨天使用shell指令碼來抽取html資料的時候,碰到了一個問題,如果要抽取的資料成了如下的情形時,資料的抽取就會出現不一致,有一些記錄會沒有資料,只顯示"未開售" 這個時候如果還是按照原來的思路來抽取就會出現資料混亂的情況,比如根據第一列抽取資料一共有75 行,但是根據右邊的賠率只能得到74行,有一行的資料混亂,後面的資料就全亂了。

+1 
5.80 4.40 1.38
  2.58 3.55 2.18
2
未開售
  1.55 4.30 4.00

大體的Html程式碼如下:
可以看到對應的div FM2,FHMW如果都有資料,都含有3行對應的資料,如果div FMW中只顯示“未開售”的時候,只有一行。
這樣資料明顯的不對應。
需要找到一定的規律來有條件的抽取和過濾。

                                                         <div class="selection">
                                                             <div class="FMW">
                                                                 <a class="homewin btn" op="w"><span class="num">6.00<span class="hoverArea" oncontextmenu="retur
 n false;"><cite></cite></a>
                                                                 <a class="draw btn" op="d"><span class="num">4.30<span class="hoverArea" oncontextmenu="return f
 alse;"><cite></cite></a>
                                                                 <a class="awaywin btn" op="l"><span class="num">1.38<span class="hoverArea" oncontextmenu="retur
 n false;"><cite></cite></a>
                                                            

                                                             <div class="FHMW">
                                                                 <a class="homewin btn" op="hdw"><span class="num">2.55<span class="hoverArea" oncontextmenu="ret
 urn false;"><cite></cite></a>
                                                                     <a class="draw btn" op="hdd"><span class="num">3.50<span class="hoverArea" oncontextmenu="re
 turn false;"><cite></cite></a>
                                                                     <a class="awaywin btn" op="hdl"><span class="num">2.22<span class="hoverArea" oncontextmenu=
 "return false;"><cite></cite></a>
                                                            

 
                                                      <span class="handicap td">
                                                             <em class="num nh">0</em>
                                                             <em class="num h"><b class="w0 td1 num h-pt">-2</b></em>
                                                        
                                                         <div class="odds-area td no-select">
                                                         <div class="selection">
                                                             <div class="FMW">
                                                                 <em class="no-sale">未開售</em>
                                                                                                                                 
                                                            

                                                             <div class="FHMW">
                                                                 <a class="homewin btn" op="hdw"><span class="num">1.53<span class="hoverArea" oncontextmenu="return false;"><cite></cite></a>
                                                                     <a class="draw btn" op="hdd"><span class="num">4.45<span class="hoverArea" oncontextmenu="return false;"><cite></cite></a>
                                                                     <a class="awaywin btn" op="hdl"><span class="num">4.00<span class="hoverArea" oncontextmenu="return false;"><cite></cite></a>
                                                            
    


現在來做一個改進,按照div來抽取。

grep "em class=\"vs\"" *e|awk -F"score-text\">" '{print $2}'|awk -F"<" '{print $1}' > vs.lst
grep  -A4 "div class=\"selection\"" *e|grep -A3 "FMW" |awk '{ if($2~/no-sale/) {print "\"num\">0< \n \"num\">0< \n \"num\">0< \n" } else {print $5$6}}'|awk -F"num\">" '{print $2}'|awk -F"<" '{print $1}' |awk -v RS= '{print $1" " $2" " $3}' > fmw.lst
grep  -A9 "div class=\"selection\"" *e|grep -A3 "FHMW" |awk '{ if($2~/no-sale/) {print "\"num\">0< \n \"num\">0< \n \"num\">0< \n" } else {print $5$6}}'|awk -F"num\">" '{print $2}'|awk -F"<" '{print $1}' |awk -v RS= '{print $1" " $2" " $3}' > hfmw.lst

抽取後的資料就會很自然的顯示出來,我們把“未開售”的記錄設定為0
資料抽取過濾後,結果就會如下所示,資料就不會亂了。
1,3.40,3.35,1.88,1.71,3.65,3.70
2,0,0,0,2.85,4.20,1.85
1,4.90,3.95,1.50,2.24,3.60,2.47
1,7.10,4.80,1.29,2.95,3.75,1.91
1,5.30,3.85,1.48,2.26,3.35,2.58
1,5.00,4.00,1.49,2.25,3.55,2.48
1,3.20,3.40,1.93,1.68,3.75,3.75

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/30633755/viewspace-2127770/,如需轉載,請註明出處,否則將追究法律責任。

相關文章