Regular Expressions and Naming Capturing Groups
A regular expression matches a text or parts of it. In effect, the capture group splits the regular expression into one or more fragments. In this way, each piece is accessible as part of the group. While the full regular expression defines group number 0, any opening brace defines a further group. Consequently, the group uses numbers in the order of the opening braces. Because editing the regex may affect the numbering, this is not intuitive for humans.
Regular Expression
(foo)([bar]+)
Naming Capturing Groups are more human-readable.
These named groups use a little extension and associate the group with a text. For this reason, editing the regex does not affect a group’s name. The label is sticky! In particular, a regex may include named and unnamed groups in any order.
Regular Expression with Named Capturing Groups
(?<key1>foo)(?<key2>[bar]+)
In which way naming capturing groups may help?
In the case of a match, the group name and the matching content forms a key-value pair. In other words, when the regex matches, we can build key-value pairs from any fragment. The art is to extract these pairs in the right way.
Webserver Accesslog Example
171.204.119.130 - - [03/Aug/2020:20:23:00 +0200] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://bytefusion.de/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
Java Example
// read accesslog
df = spark.read().text("samples/access.log");
// apply regex as udf
df = df.withColumn("details", SparkTools.regex(df.col("value"),"(?<ip>(([0-9]+)(.[0-9]+){3}))s(?<identd>[^s]+)s(?<user>[^s]+)s[(?<datetime>[^]]+)]s"(?<request>((?<type>GET|POST|HEAD|DELETE|OPTIONS|TRACE|PUT|OPTIONS|TRACE) (?<uri>[^s]+) (HTTP/(?<httpversion>[^s]+))|[^"]|(?<=)")+)"s(?<httpstatus>[0-9]+)s(?<size>[0-9]+)s"(?<referrer>([^"]|(?<=)")+)"s"(?<agent>([^"]|(?<=)")+)"s"(?<xxxx>([^"]|(?<=)")+)""));
// output schema
df.printSchema()
// output result
select("details.*").show();
First, create a data frame from the log-file (line 2). Second, the regex projects a new column to the data frame using a user-defined function (line 5). Finally, we print the result.
Output from dataFrame.printSchema()
root
|-- value: string (nullable = true)
|-- filename: string (nullable = false)
|-- details: struct (nullable = true)
| |-- ip: string (nullable = true)
| |-- identd: string (nullable = true)
| |-- user: string (nullable = true)
| |-- datetime: string (nullable = true)
| |-- request: string (nullable = true)
| |-- type: string (nullable = true)
| |-- uri: string (nullable = true)
| |-- httpversion: string (nullable = true)
| |-- httpstatus: string (nullable = true)
| |-- size: string (nullable = true)
| |-- referrer: string (nullable = true)
| |-- agent: string (nullable = true)
| |-- xxxx: string (nullable = true)
Output from select("...").show()
+---------------+------+----+--------------------+--------------------+----+--------------------+-----------+----------+----+--------------------+--------------------+----+
| ip|identd|user| datetime| request|type| uri|httpversion|httpstatus|size| referrer| agent|xxxx|
+---------------+------+----+--------------------+--------------------+----+--------------------+-----------+----------+----+--------------------+--------------------+----+
|109.169.248.247| -| -|12/Dec/2015:18:25...|GET /administrato...| GET| /administrator/| 1.1| 200|4263| -|Mozilla/5.0 (Wind...| -|
|109.169.248.247| -| -|12/Dec/2015:18:25...|POST /administrat...|POST|/administrator/in...| 1.1| 200|4494|http://bytefusion...|Mozilla/5.0 (Wind...| -|
| 46.72.177.4| -| -|12/Dec/2015:18:31...|GET /administrato...| GET| /administrator/| 1.1| 200|4263| -|Mozilla/5.0 (Wind...| -|
| 46.72.177.4| -| -|12/Dec/2015:18:31...|POST /administrat...|POST|/administrator/in...| 1.1| 200|4494|http://bytefusion...|Mozilla/5.0 (Wind...| -|
| 83.167.113.100| -| -|12/Dec/2015:18:31...|GET /administrato...| GET| /administrator/| 1.1| 200|4263| -|Mozilla/5.0 (Wind...| -|
| 83.167.113.100| -| -|12/Dec/2015:18:31...|POST /administrat...|POST|/administrator/in...| 1.1| 200|4494|http://bytefusion...|Mozilla/5.0 (Wind...| -|
| 95.29.198.15| -| -|12/Dec/2015:18:32...|GET /administrato...| GET| /administrator/| 1.1| 200|4263| -|Mozilla/5.0 (Wind...| -|
| 95.29.198.15| -| -|12/Dec/2015:18:32...|POST /administrat...|POST|/administrator/in...| 1.1| 200|4494|http://bytefusion...|Mozilla/5.0 (Wind...| -|
| 109.184.11.34| -| -|12/Dec/2015:18:32...|GET /administrato...| GET| /administrator/| 1.1| 200|4263| -|Mozilla/5.0 (Wind...| -|
| 109.184.11.34| -| -|12/Dec/2015:18:32...|POST /administrat...|POST|/administrator/in...| 1.1| 200|4494|http://bytefusion...|Mozilla/5.0 (Wind...| -|
| 91.227.29.79| -| -|12/Dec/2015:18:33...|GET /administrato...| GET| /administrator/| 1.1| 200|4263| -|Mozilla/5.0 (Wind...| -|
| 91.227.29.79| -| -|12/Dec/2015:18:33...|POST /administrat...|POST|/administrator/in...| 1.1| 200|4494|http://bytefusion...|Mozilla/5.0 (Wind...| -|
| 90.154.66.233| -| -|12/Dec/2015:18:36...|GET /administrato...| GET| /administrator/| 1.1| 200|4263| -|Mozilla/5.0 (Wind...| -|
| 90.154.66.233| -| -|12/Dec/2015:18:36...|POST /administrat...|POST|/administrator/in...| 1.1| 200|4494|http://bytefusion...|Mozilla/5.0 (Wind...| -|
| 95.140.24.131| -| -|12/Dec/2015:18:38...|GET /administrato...| GET| /administrator/| 1.1| 200|4263| -|Mozilla/5.0 (Wind...| -|
| 95.140.24.131| -| -|12/Dec/2015:18:38...|POST /administrat...|POST|/administrator/in...| 1.1| 200|4494|http://bytefusion...|Mozilla/5.0 (Wind...| -|
| 95.188.245.16| -| -|12/Dec/2015:18:38...|GET /administrato...| GET| /administrator/| 1.1| 200|4263| -|Mozilla/5.0 (Wind...| -|
| 95.188.245.16| -| -|12/Dec/2015:18:38...|POST /administrat...|POST|/administrator/in...| 1.1| 200|4494|http://bytefusion...|Mozilla/5.0 (Wind...| -|
| 46.72.213.133| -| -|12/Dec/2015:18:39...|GET /administrato...| GET| /administrator/| 1.1| 200|4263| -|Mozilla/5.0 (Wind...| -|
| 46.72.213.133| -| -|12/Dec/2015:18:39...|POST /administrat...|POST|/administrator/in...| 1.1| 200|4494|http://bytefusion...|Mozilla/5.0 (Wind...| -|
+---------------+------+----+--------------------+--------------------+----+--------------------+-----------+----------+----+--------------------+--------------------+----+
Conclusion
The combination of capturing groups and Apache Spark creates a powerful log file analytics engine. Load textfiles, use regex to generate data frames, and focus on analytics.
Find more details here:
- The interactice online editor to compose expressions. A really great tool!
- My first experiments in 2017.
- https://github.com/bfblog/spark-tools