Hadoop + Hiveでapacheアクセスログ解析

Hiveのセットアップは、Hadoopが正常に実行できてる環境下で、アーカイブ(hive-0.7.0-bin.tar.gz)を適当な場所に解凍します。HADOOP_HOMEの設定とhiveのbinにPATHを通して、hiveコマンドを実行するのみです。
hiveコンソールが起動します。

まずapacheログのテーブルを作成します。とりあえずログのフォーマットはcommonです。combinedはこのあとで。

hive> CREATE TABLE apachelog (                                                            
    > ipaddress STRING, identd STRING, user STRING,finishtime STRING,                     
    > requestline string, returncode INT, size INT)                                       
    > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe'          
    > WITH SERDEPROPERTIES (                                                              
    > 'serialization.format'='org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol',
    > 'quote.delim'='("|\\[|\\])',                                                        
    > 'field.delim'=' ',                                                                  
    > 'serialization.null.format'='-')                                                    
    > STORED AS TEXTFILE;                                                                 
OK
Time taken: 0.233 seconds
hive> 

ログファイルを読み込みます。

hive> LOAD DATA LOCAL INPATH '/tmp/access_log' OVERWRITE INTO TABLE apachelog;            
Copying data from file:/tmp/access_log
Copying file: file:/tmp/access_log
Loading data to table default.apachelog
Deleted hdfs://localhost/user/hive/warehouse/apachelog
OK
Time taken: 0.307 seconds
hive> 

これでデータの読み込みできましたので、HiveQLで集計を行います。
IPアドレスで集約して件数の多い順に表示します。
この時ジョブは3つ実行されるようです。

hive> SELECT ipaddress, COUNT(1) AS numrequest FROM apachelog GROUP BY ipaddress SORT BY numrequest DESC;
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_201105010410_0010, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201105010410_0010
Kill Command = /opt/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:8021 -kill job_201105010410_0010
2011-05-01 06:17:50,384 Stage-1 map = 0%,  reduce = 0%
2011-05-01 06:17:59,452 Stage-1 map = 100%,  reduce = 0%
2011-05-01 06:18:08,493 Stage-1 map = 100%,  reduce = 33%
2011-05-01 06:18:11,520 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201105010410_0010
Launching Job 2 out of 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_201105010410_0011, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201105010410_0011
Kill Command = /opt/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:8021 -kill job_201105010410_0011
2011-05-01 06:18:23,163 Stage-2 map = 0%,  reduce = 0%
2011-05-01 06:18:29,191 Stage-2 map = 100%,  reduce = 0%
2011-05-01 06:18:38,251 Stage-2 map = 100%,  reduce = 100%
Ended Job = job_201105010410_0011
Launching Job 3 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_201105010410_0012, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201105010410_0012
Kill Command = /opt/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:8021 -kill job_201105010410_0012
2011-05-01 06:18:51,100 Stage-3 map = 0%,  reduce = 0%
2011-05-01 06:18:57,114 Stage-3 map = 100%,  reduce = 0%
2011-05-01 06:19:06,143 Stage-3 map = 100%,  reduce = 100%
Ended Job = job_201105010410_0012
OK
192.168.1.100	71
127.0.0.1	2
Time taken: 59.31 seconds
hive>