Hadoop + Hiveでapacheアクセスログ解析 combined Format編
テーブルを作成します。hive-contrib.jarを組み込まないとcombined Formatの集計が出来ないので注意です。
hive> add jar /ebs/hive-0.7.0-bin/lib/hive-contrib-0.7.0.jar; Added hive-contrib-0.7.0.jar to class path Added resource: hive-contrib-0.7.0.jar hive> CREATE TABLE apachelog2 ( > host STRING,dentity STRING,user STRING,time STRING,request STRING, > status STRING,size STRING,referer STRING,agent STRING) > ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' > WITH SERDEPROPERTIES ( > 'input.regex' = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\^\\*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?", > 'output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" > ) STORED AS TEXTFILE; OK Time taken: 11.037 seconds hive>
データをロードします。
hive> LOAD DATA LOCAL INPATH '/tmp/access2_log' OVERWRITE INTO TABLE apachelog2; Copying data from file:/tmp/access2_log Copying file: file:/tmp/access2_log Loading data to table default.apacheloga Deleted hdfs://localhost/user/hive/warehouse/apacheloga OK Time taken: 0.884 seconds hive>
エージェントで集計をします。
ログのままのフォーマットだとキャリアの分類ができないので、別テーブルに再フォーマットして取り出して集計を行う必要があるでしょう。
切り出し方は、ある程度決めうちしてキャリア別の切り出しになりそうな気がします。
サンプルなので結果がしょぼいですが、OKの下にUser-Agentと件数(3件)がその結果になります。
hive> SELECT agent, COUNT(1) AS numagent FROM apachelog2 GROUP BY agent SORT BY numagent DESC; Total MapReduce jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapred.reduce.tasks= Starting Job = job_201105040831_0015, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201105040831_0015 Kill Command = /opt/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201105040831_0015 2011-05-04 09:45:12,679 Stage-1 map = 0%, reduce = 0% 2011-05-04 09:45:22,048 Stage-1 map = 100%, reduce = 0% 2011-05-04 09:45:31,110 Stage-1 map = 100%, reduce = 33% 2011-05-04 09:45:34,141 Stage-1 map = 100%, reduce = 100% Ended Job = job_201105040831_0015 Launching Job 2 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer= In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapred.reduce.tasks= Starting Job = job_201105040831_0016, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201105040831_0016 Kill Command = /opt/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201105040831_0016 2011-05-04 09:45:46,294 Stage-2 map = 0%, reduce = 0% 2011-05-04 09:45:52,478 Stage-2 map = 100%, reduce = 0% 2011-05-04 09:46:01,582 Stage-2 map = 100%, reduce = 33% 2011-05-04 09:46:04,602 Stage-2 map = 100%, reduce = 100% Ended Job = job_201105040831_0016 OK "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; ja-jp) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.3 Safari/533.19.4" 3 Time taken: 63.328 seconds hive>