Make it easy: Apache Spark, Data Frames and Regex Power

by | Aug 5, 2020 | Big Data

Regular Expressions are a powerful tool to split texts into fragments. Furthermore, Apache Spark is an analytics engine and capable of processing large amounts of data sets. The feature of naming capturing groups makes the usage of regular expressions more accessible. Unfortunately, the programming languages Java and Scala are limited in usage. So regular expressions with naming capturing groups are not available in a direct way. Important to realize, there is an easy to use approach.

Regular Expressions and Naming Capturing Groups

A regular expression matches a text or parts of it. In effect, the capture group splits the regular expression into one or more fragments. In this way, each piece is accessible as part of the group. While the full regular expression defines group number 0, any opening brace defines a further group. Consequently, the group uses numbers in the order of the opening braces. Because editing the regex may affect the numbering, this is not intuitive for humans.

Regular Expression

(foo)([bar]+)

Naming Capturing Groups are more human-readable.

These named groups use a little extension and associate the group with a text. For this reason, editing the regex does not affect a group’s name. The label is sticky! In particular, a regex may include named and unnamed groups in any order.

Regular Expression with Named Capturing Groups

(?<key1>foo)(?<key2>[bar]+)

In which way naming capturing groups may help?

In the case of a match, the group name and the matching content forms a key-value pair. In other words, when the regex matches, we can build key-value pairs from any fragment. The art is to extract these pairs in the right way.

Webserver Accesslog Example

171.204.119.130 - - [03/Aug/2020:20:23:00 +0200] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://bytefusion.de/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
The above text is an extract from an access log file. Of course, the experienced user identifies a set of entities. In an earlier post, I described my first try to extract key-value pairs from files. Correspondingly, we do the same. Not only by using regex but also with some more comfort. 

Java Example

// read accesslog
df = spark.read().text("samples/access.log");
// apply regex as udf
df = df.withColumn("details", SparkTools.regex(df.col("value"),"(?<ip>(([0-9]+)(.[0-9]+){3}))s(?<identd>[^s]+)s(?<user>[^s]+)s[(?<datetime>[^]]+)]s"(?<request>((?<type>GET|POST|HEAD|DELETE|OPTIONS|TRACE|PUT|OPTIONS|TRACE) (?<uri>[^s]+) (HTTP/(?<httpversion>[^s]+))|[^"]|(?<=)")+)"s(?<httpstatus>[0-9]+)s(?<size>[0-9]+)s"(?<referrer>([^"]|(?<=)")+)"s"(?<agent>([^"]|(?<=)")+)"s"(?<xxxx>([^"]|(?<=)")+)""));
// output schema
df.printSchema()
// output result
select("details.*").show();

First, create a data frame from the log-file (line 2). Second, the regex projects a new column to the data frame using a user-defined function (line 5). Finally, we print the result.

Output from dataFrame.printSchema()

root
|-- value: string (nullable = true)
|-- filename: string (nullable = false)
|-- details: struct (nullable = true)
|    |-- ip: string (nullable = true)
|    |-- identd: string (nullable = true)
|    |-- user: string (nullable = true)
|    |-- datetime: string (nullable = true)
|    |-- request: string (nullable = true)
|    |-- type: string (nullable = true)
|    |-- uri: string (nullable = true)
|    |-- httpversion: string (nullable = true)
|    |-- httpstatus: string (nullable = true)
|    |-- size: string (nullable = true)
|    |-- referrer: string (nullable = true)
|    |-- agent: string (nullable = true)
|    |-- xxxx: string (nullable = true)
Not to mention, the projected columns are of string type. Similarly to the result of the regex capturing groups. The spark API helps in transforming the date or numeric data into other spark types.

Output from select("...").show()

+---------------+------+----+--------------------+--------------------+----+--------------------+-----------+----------+----+--------------------+--------------------+----+
|             ip|identd|user|            datetime|             request|type|                 uri|httpversion|httpstatus|size|            referrer|               agent|xxxx|
+---------------+------+----+--------------------+--------------------+----+--------------------+-----------+----------+----+--------------------+--------------------+----+
|109.169.248.247|     -|   -|12/Dec/2015:18:25...|GET /administrato...| GET|     /administrator/|        1.1|       200|4263|                   -|Mozilla/5.0 (Wind...|   -|
|109.169.248.247|     -|   -|12/Dec/2015:18:25...|POST /administrat...|POST|/administrator/in...|        1.1|       200|4494|http://bytefusion...|Mozilla/5.0 (Wind...|   -|
|    46.72.177.4|     -|   -|12/Dec/2015:18:31...|GET /administrato...| GET|     /administrator/|        1.1|       200|4263|                   -|Mozilla/5.0 (Wind...|   -|
|    46.72.177.4|     -|   -|12/Dec/2015:18:31...|POST /administrat...|POST|/administrator/in...|        1.1|       200|4494|http://bytefusion...|Mozilla/5.0 (Wind...|   -|
| 83.167.113.100|     -|   -|12/Dec/2015:18:31...|GET /administrato...| GET|     /administrator/|        1.1|       200|4263|                   -|Mozilla/5.0 (Wind...|   -|
| 83.167.113.100|     -|   -|12/Dec/2015:18:31...|POST /administrat...|POST|/administrator/in...|        1.1|       200|4494|http://bytefusion...|Mozilla/5.0 (Wind...|   -|
|   95.29.198.15|     -|   -|12/Dec/2015:18:32...|GET /administrato...| GET|     /administrator/|        1.1|       200|4263|                   -|Mozilla/5.0 (Wind...|   -|
|   95.29.198.15|     -|   -|12/Dec/2015:18:32...|POST /administrat...|POST|/administrator/in...|        1.1|       200|4494|http://bytefusion...|Mozilla/5.0 (Wind...|   -|
|  109.184.11.34|     -|   -|12/Dec/2015:18:32...|GET /administrato...| GET|     /administrator/|        1.1|       200|4263|                   -|Mozilla/5.0 (Wind...|   -|
|  109.184.11.34|     -|   -|12/Dec/2015:18:32...|POST /administrat...|POST|/administrator/in...|        1.1|       200|4494|http://bytefusion...|Mozilla/5.0 (Wind...|   -|
|   91.227.29.79|     -|   -|12/Dec/2015:18:33...|GET /administrato...| GET|     /administrator/|        1.1|       200|4263|                   -|Mozilla/5.0 (Wind...|   -|
|   91.227.29.79|     -|   -|12/Dec/2015:18:33...|POST /administrat...|POST|/administrator/in...|        1.1|       200|4494|http://bytefusion...|Mozilla/5.0 (Wind...|   -|
|  90.154.66.233|     -|   -|12/Dec/2015:18:36...|GET /administrato...| GET|     /administrator/|        1.1|       200|4263|                   -|Mozilla/5.0 (Wind...|   -|
|  90.154.66.233|     -|   -|12/Dec/2015:18:36...|POST /administrat...|POST|/administrator/in...|        1.1|       200|4494|http://bytefusion...|Mozilla/5.0 (Wind...|   -|
|  95.140.24.131|     -|   -|12/Dec/2015:18:38...|GET /administrato...| GET|     /administrator/|        1.1|       200|4263|                   -|Mozilla/5.0 (Wind...|   -|
|  95.140.24.131|     -|   -|12/Dec/2015:18:38...|POST /administrat...|POST|/administrator/in...|        1.1|       200|4494|http://bytefusion...|Mozilla/5.0 (Wind...|   -|
|  95.188.245.16|     -|   -|12/Dec/2015:18:38...|GET /administrato...| GET|     /administrator/|        1.1|       200|4263|                   -|Mozilla/5.0 (Wind...|   -|
|  95.188.245.16|     -|   -|12/Dec/2015:18:38...|POST /administrat...|POST|/administrator/in...|        1.1|       200|4494|http://bytefusion...|Mozilla/5.0 (Wind...|   -|
|  46.72.213.133|     -|   -|12/Dec/2015:18:39...|GET /administrato...| GET|     /administrator/|        1.1|       200|4263|                   -|Mozilla/5.0 (Wind...|   -|
|  46.72.213.133|     -|   -|12/Dec/2015:18:39...|POST /administrat...|POST|/administrator/in...|        1.1|       200|4494|http://bytefusion...|Mozilla/5.0 (Wind...|   -|
+---------------+------+----+--------------------+--------------------+----+--------------------+-----------+----------+----+--------------------+--------------------+----+

Conclusion

The combination of capturing groups and Apache Spark creates a powerful log file analytics engine. Load textfiles, use regex to generate data frames, and focus on analytics. 

Find more details here: