apache日誌匯入oracle(日誌經過python處理)

perfychi發表於2012-11-19

1、apache日誌格式:
    LogFormat "%h %l %u %t \"%r\" %>s %b" common(引用httpd.conf配置檔案)
注視:每個位置代表什麼含義網上都有,可以自己看。

2、apache日誌內容:(第一行空格已經標出,其他行格式一樣,不標了)
183.31.43.112 空格 -空格 -空格 [06/Nov/2012:00:00:17 空格+0800] 空格"GET空格 /hz HTTP/1.1" 空格302空格 -
211.103.82.175 - - [06/Nov/2012:00:00:24 +0800] "GET /hz HTTP/1.1" 302 -
221.179.37.12 - - [06/Nov/2012:00:00:33 +0800] "GET /hz?t=35245 HTTP/1.0" 302 -
211.139.145.101 - - [06/Nov/2012:00:00:34 +0800] "GET /hz HTTP/1.1" 302 -
120.196.208.79 - - [06/Nov/2012:00:00:38 +0800] "GET /hz HTTP/1.1" 302 -
117.136.12.210 - - [06/Nov/2012:00:00:38 +0800] "GET /hz HTTP/1.1" 302 -
221.179.85.10 - - [06/Nov/2012:00:00:40 +0800] "GET /hz?t=43385 HTTP/1.0" 302 -
119.139.89.179 - - [06/Nov/2012:00:00:46 +0800] "GET /hz HTTP/1.1" 302 -
。。。。。。。。。
。。。。。。。。。
。。。。。。。。。


3、試驗要求:
把apache日誌中每行的 “遠端主機IP/主機名”,“訪問時間”,“url”,“ 狀態”匯入到oracle資料庫中。


4、oracle資料庫建立的相關表及檢視:
(1) create table APACHE_LOG
(
  remote_host CHAR(15),
  access_time VARCHAR2(30),
  full_url    VARCHAR2(200),
  status      CHAR(10)
);
(2)create table APACHE_LOG_RPT
(
  remote_host CHAR(15),
  access_time DATE,
  real_url    VARCHAR2(392),
  full_url    VARCHAR2(200),
  status      CHAR(10)
);
(3)create or replace view v_apache_log_vi as
select REMOTE_HOST  , trunc(to_date(ACCESS_TIME ,'dd/mon/yyyy hh24:mi:ss')) access_time ,
case when instr(full_url, '?') > 0 then
  substr(full_url, 5, instr(full_url, '?')-5 )
  else
      substr(full_url, 5, instr(full_url, ' ',5) -5 )
  end real_url
 , FULL_URL , STATUS  from apache_log;

5、使用python格式化apache日誌:

(1)python指令碼1:processor.py

#!/bin/env python
import  sys, re

class Processor:
    def __init__(self, reader, writer):
        self.reader =  open(reader, 'r')
        self.writer =  open(writer, 'w')

    def convert(self, data):
        assert 0, 'convert needed to be defined'

    def process(self):
        while 1:
            line = self.reader.readline()
            if not line:  break
            data = self.convert(line)
            self.writer.write(data)


class UpperCase(Processor):
    def convert(self, data):
        return data.upper()

class HTMLize:
    def write(self, data):
        print  '%s' % data[:-1]

class ScanApacheLog(Processor):
    def convert(self, data):
        p = re.compile(r'''
        (?P\S+) #remote_host
        \s+ #whitespace
        \S+ #remote_login
        \s+ #whitespace
        \S+ #remote_user
        \s+ #whitespace
        (?P
        \s+ #whitespace
        (?P".*") #first line of request
        \s+ #whitespace
        (?P\d+) #status
        \s+ #whitespace
        (-|\d+) # send_size
        \s*
        ''', re.X)
        m = p.match(data)
        if m is not None:
            arr = m.groupdict()
            request = re.match( r'"(.*)"', arr['request']).group(1)
            access_time = re.match(r'\[(\S+)\s+\S+\]', arr['time']).group(1)
            line = arr['remote_host'] + ','  + access_time + ',' +  request + ',' +  arr['status'] + '\n'
            return line

if  __name__ == '__main__':
    UpperCase(open('1.txt'), sys.stdout).process()
    UpperCase(open('1.txt'), open('2.txt', 'w')).process()
    UpperCase(open('1.txt'), HTMLize()).process()


(2)python指令碼2:scan.py
#!/bin/env python

from  processor import * 

if __name__ == '__main__':
    reader='/home/oracle/v1x-access_log_20121106'
    writer='/tmp/dealed.log'
    ScanApacheLog(reader, writer ).process()    

(3)把上面兩個指令碼放在同一個目錄下,執行#python scan.py 即可
    reader='/home/oracle/v1x-access_log_20121106'記錄的是apache日誌位置;
       writer='/tmp/dealed.log' 記錄的是經過python指令碼格式化處理的檔案,也是下面步驟要使用的檔案。

/tmp/dealed.log'部分內容如下:
112.95.194.245,06/Nov/2012:00:00:01,GET /hz HTTP/1.1,302
120.196.208.72,06/Nov/2012:00:00:07,GET /hz HTTP/1.1,302
113.74.6.56,06/Nov/2012:00:00:09,GET /hz HTTP/1.1,302
183.31.43.112,06/Nov/2012:00:00:17,GET /hz HTTP/1.1,302
211.103.82.175,06/Nov/2012:00:00:24,GET /hz HTTP/1.1,302
221.179.37.12,06/Nov/2012:00:00:33,GET /hz?t=35245 HTTP/1.0,302
211.139.145.101,06/Nov/2012:00:00:34,GET /hz HTTP/1.1,302
120.196.208.79,06/Nov/2012:00:00:38,GET /hz HTTP/1.1,302
。。。。。。。
。。。。。。。

   
6、利用 sqlldr工具把'/tmp/dealed.log匯入到oracle資料庫的apache_log表
首先編寫控制檔案,控制檔案很重要
[oracle@shanxi myself]$ more apachelog2.ctl 
load data
truncate into table  apache_log
fields terminated by ',' 
trailing nullcols
(
REMOTE_HOST,
ACCESS_TIME ,
FULL_URL ,
STATUS 
)

,然後把/tmp/dealed.log匯入資料庫:

[oracle@shanxi myself]$ sqlldr scott/###  control=apachelog.ctl  data=/tmp/dealed.log

7、最後一步,使用前面建立的封裝處理邏輯的檢視匯入到apache_log_rpt表
SQL> alter session set nls_date_language='AMERICAN';
SQL > insert into apache_log_rpt select * from v_apachelog_vi;
因為涉及到日誌格式轉換,所以需要修改session的nls_date_language引數( 主要是“select REMOTE_HOST  , trunc(to_date(ACCESS_TIME ,'dd/mon/yyyy hh24:mi:ss')) access_time”的mon部分是那種格式,與nls_date_language是CHINEASE還是AMERICAN有關係),否則如果mon格式跟nls_date_language不一致的話,查詢檢視v_apache_log_vi的時候會報錯的。就是因為這個原因導致的。
   


這個處理邏輯就是上面這些。下面列出我之前接觸過的這方面的指令碼,功能是類似的:

試驗要求是使用crond程式呼叫shell指令碼,每天定時自動匯入昨天的apache日誌(v1x-access_log_***)到資料庫中。
目錄結構如下:
[oracle@searchdb2 apache]$pwd
/logbackup/apache
[oracle@searchdb2 apache]$ll
total 88
drwxr-xr-x 2 yangzhanli dev  9216 Oct  1 08:50 201209
drwxr-xr-x 2 yangzhanli dev 16384 Nov  1 08:50 201210
drwxr-xr-x 2 yangzhanli dev 10240 Nov 19 08:50 201211
drwxr-xr-x 2 yangzhanli dev 38912 Nov 19 08:50 md5sumfile
[oracle@searchdb2 apache]$cd 

[oracle@searchdb2 apache]$cd 201211/
[oracle@searchdb2 201211]$ls -l | grep v1x | more
-rw-r--r-- 1 yangzhanli dev  820634 Nov  1 23:59 v1x-access_log_20121101
-rw-r--r-- 1 yangzhanli dev  658913 Nov  2 23:59 v1x-access_log_20121102
-rw-r--r-- 1 yangzhanli dev 1066549 Nov  3 23:59 v1x-access_log_20121103
-rw-r--r-- 1 yangzhanli dev  644101 Nov  4 23:59 v1x-access_log_20121104
-rw-r--r-- 1 yangzhanli dev  888134 Nov  5 23:59 v1x-access_log_20121105
-rw-r--r-- 1 yangzhanli dev 1336714 Nov  6 23:59 v1x-access_log_20121106
-rw-r--r-- 1 yangzhanli dev  596688 Nov  7 23:59 v1x-access_log_20121107
-rw-r--r-- 1 yangzhanli dev  497143 Nov  8 23:59 v1x-access_log_20121108

cron呼叫如下:
[oracle@searchdb2 201211]$ crontab -l
0 7 * * * /home/oracle/admin/tarsmsglogs/bin/tar_smslogs.sh cm
0 8 * * * /home/oracle/admin/tarsmsglogs/bin/tar_smslogs.sh ct
0 6 * 1 *  /home/oracle/insert_log/insert_log.sh
4 9 * * *  /home/oracle/admin/apache/import_apachelog2.sh > /dev/null 2>&1

shell指令碼import_apachelog2.sh內容如下:

!/bin/bash
# Filename: import_apachelog2.sh
# Date: 2012-10-31
# Desc: import the logfile '/logbackup/apache/${month}/ v1x-access_log_**.log' into the table  'apache_log_rpt' in SPICEMOM schema
# Usage:  /home/oracle/admin/apache/import_apachelog2.sh
# Version: 1.1
. /home/oracle/.bash_profile

LOG_ROOT=/logbackup/apache
CUR_MONTH=`date -d '1 day ago' +%Y%m`
YESTERDAY=`date -d '1 day ago' +%Y%m%d`
LOG_PREFIX=v1x-access_log
LOG_FILE=${LOG_ROOT}/${CUR_MONTH}/${LOG_PREFIX}_${YESTERDAY}
SCRIPT_PATH=/home/oracle/admin/apache
#echo $LOG_FILE
#echo  ${SCRIPT_PATH}
#exit

cd  ${SCRIPT_PATH}
if [ ! -f $LOG_FILE ]; then
        now=`date '+ %Y-%m-%d %H:%M:%S'`
        echo "* Failed at $now:the logfile ${LOG_FILE} needed to import  doesn't exist !" >> import_history.log
        exit
fi

cp ${LOG_FILE} /tmp/v1x.log

python scanlog.py

sqlldr spicemom/spicemom control=apachelog.ctl  data="/tmp/scanlog.txt"

if [ "$?" -eq 0 ] ; then
        now=`date '+ %Y-%m-%d %H:%M:%S'`
        sqlplus spicemom/spicemom @append_log.sql
        echo "At $now:the logfile ${LOG_FILE} succeeds to  import  the table apache_log_rpt !" >> import_history.log
        rm -rf /tmp/v1x.log
        rm -rf /tmp/scanlog.txt
fi


所有涉及到的指令碼目錄結構如下:
[oracle@searchdb2 201211]$cd /home/oracle/admin/apache
[oracle@searchdb2 apache]$ll
total 36
-rw-r--r-- 1 oracle dba   69 Sep 12 15:29 afiedt.buf
-rw-r--r-- 1 oracle dba  383 Oct 30 13:38 apachelog.ctl
-rw-r--r-- 1 oracle dba 1750 Nov 19 09:04 apachelog.log
-rw-r--r-- 1 oracle dba  135 Oct 29 20:29 append_log.sql
-rwxr--r-- 1 oracle dba  968 Nov  4 12:19 import_apachelog.sh
-rwxr--r-- 1 oracle dba 1154 Nov  4 12:19 import_apachelog2.sh
-rw-r--r-- 1 oracle dba 2788 Nov 19 09:04 import_history.log
drwxr-xr-x 2 oracle dba 4096 Oct 30 16:00 orig_source
-rw-r--r-- 1 oracle dba  938 Oct 30 13:35 scanlog.py


再看一下指令碼內容:
python的指令碼得注意格式要符合程式結構,否則的話可能執行不了。python的程式碼是不是一段程式碼塊不是靠類似C語言的中括號{},而是靠格式對齊來判斷是不是同一段程式碼塊, 這點感覺挺噁心的。(如果有人copy可得注意了)
[oracle@searchdb2 apache]$  vi scanlog.py 

#!/usr/bin/env python
import sys
import re

def ScanOneLog(sLogFile,dLogFile):
    log_line_re = re.compile(r'''(?P\S+) #IP ADDRESS
        \s+ #whitespace
        \S+ #remote logname
        \s+ #whitespace
        \S+ #remote user
        \s+ #whitespace
        (?P
        \s+ #whitespace
        (?P"[^"]+") #first line of request
        \s+ #whitespace
        (?P\d+)
        \s+ #whitespace
        (?P-|\d+)
        \s* #whitespace
        ''', re.VERBOSE)
    lf=open(dLogFile,'w')
    for line in open(sLogFile,'r'):
        m = log_line_re.match(line)
        if m is not None :
            groupdict = m.groupdict()
            tmps=groupdict['remote_host']+','+groupdict['status']+','+groupdict['time']+','+groupdict['request']+'\n'
            lf.writelines(tmps)
    lf.close()

if __name__=='__main__':
    ScanOneLog('/tmp/v1x.log','/tmp/scanlog.txt')

[oracle@searchdb2 apache]$vi apachelog.ctl 
load data
truncate into table  apache_log
FIELDS TERMINATED BY "," TRAILING NULLCOLS
(
remote_host char,
status char,
access_time char,
url char(4000)
)


[oracle@searchdb2 apache]$vi append_log.sql 
alter   session   set   nls_date_language   = 'AMERICAN';
insert into apache_log_rpt select * from v_apache_log_ori;
commit;
exit



最後給列下表和檢視結構定義:
create table APACHE_LOG
(
  remote_host VARCHAR2(20),
  url         VARCHAR2(4000),
  status      VARCHAR2(10),
  access_time VARCHAR2(30)
)

create table APACHE_LOG_RPT
(
  remote_host VARCHAR2(20),
  real_url    VARCHAR2(4000),
  url         VARCHAR2(4000),
  status      VARCHAR2(10),
  access_time DATE
)

create or replace view v_apache_log as
select "REMOTE_HOST","REAL_URL","URL","STATUS","ACCESS_TIME" from apache_log_rpt;

create or replace view v_apache_log_ori as
select remote_host,
case
       when instr(url,'?') >0 then substr(url,6,instr(url,'?')-6)
       else substr(url,6,instr(url,' ',6)-6)
       end real_url,
url,status,
  trunc(to_date(SUBSTR(ACCESS_TIME,2,LENGTH(ACCESS_TIME)-8),'dd/mon/yyyy:hh24:mi:ss') )  access_time from apache_log;

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/27042095/viewspace-749496/,如需轉載,請註明出處,否則將追究法律責任。

相關文章