Analyze Access Log with Apache Spark

by Markus Breuer | Jun 10, 2017 | Big Data

Apache Spark is a powerful tool to process large amounts of data. The docs show many examples to analyze csv-like data, which is already covered by spark csv. But how to analyze more complex data, e.g. an accesslog file? At first sight it looks quite simple because its human readable. But looking more sharp let us see, there is no simple way to use split operations to extract data. This artice shows an different approach to process text files, which easily may be adapted to any other line-based format.

The swiss army knife to analyze textual contents are regular expressions. By using capture groups within regular expressions whole lines of text are matched and split into fragments. A recent extension to regex are named capture groups which allow to analyze lines of text, split them into fields and assign each field a name and a value. The example will demonstrate how this will work.

64.242.88.10 - - [07/Mar/2004:16:58:54 -0800] "GET /mailman/listinfo/administration HTTP/1.1" 200 6459

The experienced user will see information like ip address, timestamp and http request details here. It is table data with columnar information. Simple split using whitespace as separator would break some of these columns, e.g. the date. Also the request string is only the first line of an http-request and contains whitespaces. In some cases a request may contain malicious content, e.g. if an invalid request was sent to web server.

To analyze text we start to create an regex and recognize the contents from left to right:

(?<host>[0-9]+(.[0-9]+)+)
s([A-Za-z-]+)s([A-Za-z-]+)s[
(?<time>[^]]+)]s"
(?<request>[^"]+)"s
(?<httpStatus>[0-9]+)s
(?<responseSize>[0-9]+)

line 1: extract ap address to capture group “host”
line 2: skip some data
line 3: extract timestamp without braces
line 4: extract request content
line 5: extract http-status
line 6: extract response size

The grok debugger is a useful tool to create and test regular expressions with named capture groups in an interactive way. Just edit pattern or data and see result of extraction. Unfortunately java support for named capture groups is limited, so java 8 can evaluate such expressions but the api is not able to associate group name and value. Scala regex are based on java capabilities so scala is limited, too. In fact the spark shell does not support these types of regex. The good news: a short script will fix this limitation.

Spark context Web UI available at http://192.168.178.26:4040
Spark context available as 'sc' (master = local[*], app id = local-1497040518579).
Spark session available as 'spark'.
Welcome to
____              __
/ __/__  ___ _____/ /__
_ / _ / _ `/ __/  '_/
/___/ .__/_,_/_/ /_/_   version 2.1.1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)
Type in expressions to have them evaluated.
Type :help for more information.
scala> :load "Example.scala"
Loading C:UsersmarkusworkspaceMyScalasrc
mple.scala...
...

In first step some helper code is loaded, a scala class handling named capture groups.

scala> var line = "64.242.88.10 - - [07/Mar/2004:16:47:12 -0800] "GET /robots.txt HTTP/1.1" 200 68"
line: String = 64.242.88.10 - - [07/Mar/2004:16:47:12 -0800] "GET /robots.txt HTTP/1.1" 200 68

Than a regex matching the example ist defined. In next step the regex is applied to example.

scala> var m = new SmartMatcher( pattern )
fieldNames: List(host, d1, d2, time, request, httpStatus, responseSize)
nameToGroup: Map(request -> 6, host -> 1, responseSize -> 8, d2 -> 4, httpStatus -> 7, time -> 5, d1 -> 3)
m: SmartMatcher = SmartMatcher@315d19f1
scala> m.fieldNames
res1: List[String] = List(host, d1, d2, time, request, httpStatus, responseSize)
scala> m.parse( line )
res2: List[String] = List(64.242.88.10, -, -, 07/Mar/2004:16:47:12 -0800, "GET /robots.txt HTTP/1.1", 200, 68)
scala> m.parseMap( line )
res4: Map[String,String] = Map(request -> "GET /robots.txt HTTP/1.1", host -> 64.242.88.10, responseSize -> 68, d2 -> -, httpStatus -> 200, time -> 07/Mar/2004:16:47:12 -0800, d1 -> -)

An instance of helper class SmartMatcher is assigned to variable m. Regex is analyzed an field names from named capture groups are computed. Detected fields are returned by m.fieldNames. Also example text is parsed and a list of values is returned. An alternative is parseMap which return a map of key-values (field-value) pairs. This are the basics to process files, so let’s start processing whole files.

scala> var accesslog = spark.read.textFile("c:/tmp/access*")
accesslog: org.apache.spark.sql.Dataset[String] = [value: string]
scala> var myDS = accesslog.map( line => matcher.parse(line) )
myDS: org.apache.spark.sql.Dataset[List[String]] = [value: array<string>]
scala> var myRDD = myDS.rdd
myRDD: org.apache.spark.rdd.RDD[List[String]] = MapPartitionsRDD[10] at rdd at <console>:55
scala> myRDD.take(1)
res13: Array[List[String]] = Array(List(64.242.88.10, -, -, 07/Mar/2004:16:05:49 -0800, "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1", 401, 12846))

line 1: read file contents into RDD
line 4: create dataset from rdd, dataset is List[String] with an entry per column
line 6: create RDD from dataset
line 10: show some sample data

For more comfortable work we will convert the rdd to a dataframe. A dataframe requires a schema defintition, so any column is mapped to string-type.

scala> val fields = matcher.fieldNames.map(field => org.apache.spark.sql.types.StructField(field, org.apache.spark.sql.types.StringType, true))
fields: List[org.apache.spark.sql.types.StructField] = List(StructField(host,StringType,true), StructField(d1,StringType,true), StructField(d2,StringType,true), StructField(time,StringType,true), StructField(request,StringType,true), StructField(httpStatus,StringType,true), StructField(responseSize,StringType,true))
scala> var schema = org.apache.spark.sql.types.StructType(fields)
schema: org.apache.spark.sql.types.StructType = StructType(StructField(host,StringType,true), StructField(d1,StringType,true), StructField(d2,StringType,true), StructField(time,StringType,true), StructField(request,StringType,true), StructField(httpStatus,StringType,true), StructField(responseSize,StringType,true))

A dataframe is created by combining rdd and schema definition:

scala> var myDF = spark.createDataFrame(rowRDD, schema )
myDF: org.apache.spark.sql.DataFrame = [host: string, d1: string ... 5 more fields]
scala> myDF.show
+------------+---+---+--------------------+--------------------+----------+------------+
|        host| d1| d2|                time|             request|httpStatus|responseSize|
+------------+---+---+--------------------+--------------------+----------+------------+
|64.242.88.10|  -|  -|07/Mar/2004:16:05...|"GET /twiki/bin/e...|       401|       12846|
|64.242.88.10|  -|  -|07/Mar/2004:16:06...|"GET /twiki/bin/r...|       200|        4523|
|64.242.88.10|  -|  -|07/Mar/2004:16:10...|"GET /mailman/lis...|       200|        6291|
|64.242.88.10|  -|  -|07/Mar/2004:16:11...|"GET /twiki/bin/v...|       200|        7352|
|64.242.88.10|  -|  -|07/Mar/2004:16:20...|"GET /twiki/bin/v...|       200|        5253|
|64.242.88.10|  -|  -|07/Mar/2004:16:23...|"GET /twiki/bin/o...|       200|       11382|
|64.242.88.10|  -|  -|07/Mar/2004:16:24...|"GET /twiki/bin/v...|       200|        4924|
|64.242.88.10|  -|  -|07/Mar/2004:16:29...|"GET /twiki/bin/e...|       401|       12851|
|64.242.88.10|  -|  -|07/Mar/2004:16:30...|"GET /twiki/bin/a...|       401|       12851|
|64.242.88.10|  -|  -|07/Mar/2004:16:31...|"GET /twiki/bin/v...|       200|        3732|
|64.242.88.10|  -|  -|07/Mar/2004:16:32...|"GET /twiki/bin/v...|       200|       40520|
|64.242.88.10|  -|  -|07/Mar/2004:16:33...|"GET /twiki/bin/e...|       401|       12851|
|64.242.88.10|  -|  -|07/Mar/2004:16:35...|"GET /mailman/lis...|       200|        6379|
|64.242.88.10|  -|  -|07/Mar/2004:16:36...|"GET /twiki/bin/r...|       200|       46373|
|64.242.88.10|  -|  -|07/Mar/2004:16:37...|"GET /twiki/bin/v...|       200|        4140|
|64.242.88.10|  -|  -|07/Mar/2004:16:39...|"GET /twiki/bin/v...|       200|        3853|
|64.242.88.10|  -|  -|07/Mar/2004:16:43...|"GET /twiki/bin/v...|       200|        3686|
|64.242.88.10|  -|  -|07/Mar/2004:16:45...|"GET /twiki/bin/a...|       401|       12846|
|64.242.88.10|  -|  -|07/Mar/2004:16:47...|"GET /robots.txt ...|       200|          68|
|64.242.88.10|  -|  -|07/Mar/2004:16:47...|"GET /twiki/bin/r...|       200|        5724|
+------------+---+---+--------------------+--------------------+----------+------------+
only showing top 20 rows

Finally a text based web server acces log file is converted to an apache spark dataframe. By simply adapting regex to custom input formats, nearly any text based file can be processed. Building regex is a process of stepwise testing over several iterations.