Hadoop + Hiveでapacheアクセスログ解析
Hiveのセットアップは、Hadoopが正常に実行できてる環境下で、アーカイブ(hive-0.7.0-bin.tar.gz)を適当な場所に解凍します。HADOOP_HOMEの設定とhiveのbinにPATHを通して、hiveコマンドを実行するのみです。
hiveコンソールが起動します。
まずapacheログのテーブルを作成します。とりあえずログのフォーマットはcommonです。combinedはこのあとで。
hive> CREATE TABLE apachelog ( > ipaddress STRING, identd STRING, user STRING,finishtime STRING, > requestline string, returncode INT, size INT) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe' > WITH SERDEPROPERTIES ( > 'serialization.format'='org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol', > 'quote.delim'='("|\\[|\\])', > 'field.delim'=' ', > 'serialization.null.format'='-') > STORED AS TEXTFILE; OK Time taken: 0.233 seconds hive>
ログファイルを読み込みます。
hive> LOAD DATA LOCAL INPATH '/tmp/access_log' OVERWRITE INTO TABLE apachelog; Copying data from file:/tmp/access_log Copying file: file:/tmp/access_log Loading data to table default.apachelog Deleted hdfs://localhost/user/hive/warehouse/apachelog OK Time taken: 0.307 seconds hive>
これでデータの読み込みできましたので、HiveQLで集計を行います。
IPアドレスで集約して件数の多い順に表示します。
この時ジョブは3つ実行されるようです。
hive> SELECT ipaddress, COUNT(1) AS numrequest FROM apachelog GROUP BY ipaddress SORT BY numrequest DESC; Total MapReduce jobs = 3 Launching Job 1 out of 3 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapred.reduce.tasks= Starting Job = job_201105010410_0010, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201105010410_0010 Kill Command = /opt/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201105010410_0010 2011-05-01 06:17:50,384 Stage-1 map = 0%, reduce = 0% 2011-05-01 06:17:59,452 Stage-1 map = 100%, reduce = 0% 2011-05-01 06:18:08,493 Stage-1 map = 100%, reduce = 33% 2011-05-01 06:18:11,520 Stage-1 map = 100%, reduce = 100% Ended Job = job_201105010410_0010 Launching Job 2 out of 3 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer= In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapred.reduce.tasks= Starting Job = job_201105010410_0011, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201105010410_0011 Kill Command = /opt/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201105010410_0011 2011-05-01 06:18:23,163 Stage-2 map = 0%, reduce = 0% 2011-05-01 06:18:29,191 Stage-2 map = 100%, reduce = 0% 2011-05-01 06:18:38,251 Stage-2 map = 100%, reduce = 100% Ended Job = job_201105010410_0011 Launching Job 3 out of 3 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer= In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapred.reduce.tasks= Starting Job = job_201105010410_0012, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201105010410_0012 Kill Command = /opt/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201105010410_0012 2011-05-01 06:18:51,100 Stage-3 map = 0%, reduce = 0% 2011-05-01 06:18:57,114 Stage-3 map = 100%, reduce = 0% 2011-05-01 06:19:06,143 Stage-3 map = 100%, reduce = 100% Ended Job = job_201105010410_0012 OK 192.168.1.100 71 127.0.0.1 2 Time taken: 59.31 seconds hive>