Make it easy: Apache Spark, Data Frames and Regex Power

by | Aug 5, 2020 | Big Data

Regular Expressions are a powerful tool to split texts into fragments. Furthermore, Apache Spark is an analytics engine and capable of processing large amounts of data sets. The feature of naming capturing groups makes the usage of regular expressions more accessible. Unfortunately, the programming languages Java and Scala are limited in usage. So regular expressions with naming capturing groups are not available in a direct way. Important to realize, there is an easy to use approach.

Regular Expressions and Naming Capturing Groups

A regular expression matches a text or parts of it. In effect, the capture group splits the regular expression into one or more fragments. In this way, each piece is accessible as part of the group. While the full regular expression defines group number 0, any opening brace defines a further group. Consequently, the group uses numbers in the order of the opening braces. Because editing the regex may affect the numbering, this is not intuitive for humans.

Regular Expression

(foo)([bar]+)

Naming Capturing Groups are more human-readable.

These named groups use a little extension and associate the group with a text. For this reason, editing the regex does not affect a group’s name. The label is sticky! In particular, a regex may include named and unnamed groups in any order.

Regular Expression with Named Capturing Groups

(?<key1>foo)(?<key2>[bar]+)

In which way naming capturing groups may help?

In the case of a match, the group name and the matching content forms a key-value pair. In other words, when the regex matches, we can build key-value pairs from any fragment. The art is to extract these pairs in the right way.

Webserver Accesslog Example

171.204.119.130 - - [03/Aug/2020:20:23:00 +0200] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://bytefusion.de/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"
The above text is an extract from an access log file. Of course, the experienced user identifies a set of entities. In an earlier post, I described my first try to extract key-value pairs from files. Correspondingly, we do the same. Not only by using regex but also with some more comfort. 

Java Example

// read accesslog
df = spark.read().text("samples/access.log");
// apply regex as udf
df = df.withColumn("details", SparkTools.regex(df.col("value"),"(?<ip>(([0-9]+)(.[0-9]+){3}))s(?<identd>[^s]+)s(?<user>[^s]+)s[(?<datetime>[^]]+)]s"(?<request>((?<type>GET|POST|HEAD|DELETE|OPTIONS|TRACE|PUT|OPTIONS|TRACE) (?<uri>[^s]+) (HTTP/(?<httpversion>[^s]+))|[^"]|(?<=)")+)"s(?<httpstatus>[0-9]+)s(?<size>[0-9]+)s"(?<referrer>([^"]|(?<=)")+)"s"(?<agent>([^"]|(?<=)")+)"s"(?<xxxx>([^"]|(?<=)")+)""));
// output schema
df.printSchema()
// output result
select("details.*").show();

First, create a data frame from the log-file (line 2). Second, the regex projects a new column to the data frame using a user-defined function (line 5). Finally, we print the result.

Output from dataFrame.printSchema()

root
|-- value: string (nullable = true)
|-- filename: string (nullable = false)
|-- details: struct (nullable = true)
|    |-- ip: string (nullable = true)
|    |-- identd: string (nullable = true)
|    |-- user: string (nullable = true)
|    |-- datetime: string (nullable = true)
|    |-- request: string (nullable = true)
|    |-- type: string (nullable = true)
|    |-- uri: string (nullable = true)
|    |-- httpversion: string (nullable = true)
|    |-- httpstatus: string (nullable = true)
|    |-- size: string (nullable = true)
|    |-- referrer: string (nullable = true)
|    |-- agent: string (nullable = true)
|    |-- xxxx: string (nullable = true)
Not to mention, the projected columns are of string type. Similarly to the result of the regex capturing groups. The spark API helps in transforming the date or numeric data into other spark types.

Output from select("...").show()

+---------------+------+----+--------------------+--------------------+----+--------------------+-----------+----------+----+--------------------+--------------------+----+
|             ip|identd|user|            datetime|             request|type|                 uri|httpversion|httpstatus|size|            referrer|               agent|xxxx|
+---------------+------+----+--------------------+--------------------+----+--------------------+-----------+----------+----+--------------------+--------------------+----+
|109.169.248.247|     -|   -|12/Dec/2015:18:25...|GET /administrato...| GET|     /administrator/|        1.1|       200|4263|                   -|Mozilla/5.0 (Wind...|   -|
|109.169.248.247|     -|   -|12/Dec/2015:18:25...|POST /administrat...|POST|/administrator/in...|        1.1|       200|4494|http://bytefusion...|Mozilla/5.0 (Wind...|   -|
|    46.72.177.4|     -|   -|12/Dec/2015:18:31...|GET /administrato...| GET|     /administrator/|        1.1|       200|4263|                   -|Mozilla/5.0 (Wind...|   -|
|    46.72.177.4|     -|   -|12/Dec/2015:18:31...|POST /administrat...|POST|/administrator/in...|        1.1|       200|4494|http://bytefusion...|Mozilla/5.0 (Wind...|   -|
| 83.167.113.100|     -|   -|12/Dec/2015:18:31...|GET /administrato...| GET|     /administrator/|        1.1|       200|4263|                   -|Mozilla/5.0 (Wind...|   -|
| 83.167.113.100|     -|   -|12/Dec/2015:18:31...|POST /administrat...|POST|/administrator/in...|        1.1|       200|4494|http://bytefusion...|Mozilla/5.0 (Wind...|   -|
|   95.29.198.15|     -|   -|12/Dec/2015:18:32...|GET /administrato...| GET|     /administrator/|        1.1|       200|4263|                   -|Mozilla/5.0 (Wind...|   -|
|   95.29.198.15|     -|   -|12/Dec/2015:18:32...|POST /administrat...|POST|/administrator/in...|        1.1|       200|4494|http://bytefusion...|Mozilla/5.0 (Wind...|   -|
|  109.184.11.34|     -|   -|12/Dec/2015:18:32...|GET /administrato...| GET|     /administrator/|        1.1|       200|4263|                   -|Mozilla/5.0 (Wind...|   -|
|  109.184.11.34|     -|   -|12/Dec/2015:18:32...|POST /administrat...|POST|/administrator/in...|        1.1|       200|4494|http://bytefusion...|Mozilla/5.0 (Wind...|   -|
|   91.227.29.79|     -|   -|12/Dec/2015:18:33...|GET /administrato...| GET|     /administrator/|        1.1|       200|4263|                   -|Mozilla/5.0 (Wind...|   -|
|   91.227.29.79|     -|   -|12/Dec/2015:18:33...|POST /administrat...|POST|/administrator/in...|        1.1|       200|4494|http://bytefusion...|Mozilla/5.0 (Wind...|   -|
|  90.154.66.233|     -|   -|12/Dec/2015:18:36...|GET /administrato...| GET|     /administrator/|        1.1|       200|4263|                   -|Mozilla/5.0 (Wind...|   -|
|  90.154.66.233|     -|   -|12/Dec/2015:18:36...|POST /administrat...|POST|/administrator/in...|        1.1|       200|4494|http://bytefusion...|Mozilla/5.0 (Wind...|   -|
|  95.140.24.131|     -|   -|12/Dec/2015:18:38...|GET /administrato...| GET|     /administrator/|        1.1|       200|4263|                   -|Mozilla/5.0 (Wind...|   -|
|  95.140.24.131|     -|   -|12/Dec/2015:18:38...|POST /administrat...|POST|/administrator/in...|        1.1|       200|4494|http://bytefusion...|Mozilla/5.0 (Wind...|   -|
|  95.188.245.16|     -|   -|12/Dec/2015:18:38...|GET /administrato...| GET|     /administrator/|        1.1|       200|4263|                   -|Mozilla/5.0 (Wind...|   -|
|  95.188.245.16|     -|   -|12/Dec/2015:18:38...|POST /administrat...|POST|/administrator/in...|        1.1|       200|4494|http://bytefusion...|Mozilla/5.0 (Wind...|   -|
|  46.72.213.133|     -|   -|12/Dec/2015:18:39...|GET /administrato...| GET|     /administrator/|        1.1|       200|4263|                   -|Mozilla/5.0 (Wind...|   -|
|  46.72.213.133|     -|   -|12/Dec/2015:18:39...|POST /administrat...|POST|/administrator/in...|        1.1|       200|4494|http://bytefusion...|Mozilla/5.0 (Wind...|   -|
+---------------+------+----+--------------------+--------------------+----+--------------------+-----------+----------+----+--------------------+--------------------+----+

Conclusion

The combination of capturing groups and Apache Spark creates a powerful log file analytics engine. Load textfiles, use regex to generate data frames, and focus on analytics. 

Find more details here:

0 Comments

Leave a Reply

Explore Articles That Align With Your Interests

Overprovisioned Host System – A Nightmare

Overprovisioned host systems in virtualized environments often cause performance issues. Steal Time is a reliable indicator for identifying such bottlenecks. This article explains how to monitor Steal Time using top, the impact of high values, and how monitoring tools...

Well documented: Architecture Decision Records

Heard about Architecture Decision Records? Anyone who moves to a new team quickly faces familiar questions. Why did colleagues solve the problem in this way? Did they not see the consequences? The other approach would have offered many advantages. Or did they see...

Why Event-Driven Architecture?

What is event-driven architecture? What are the advantages of event-driven architecture, and when should I use it? What advantages does it offer, and what price do I pay? In the following, we will look at what constitutes an event-driven architecture and how it...

On-Premise? IaaS vs. PaaS vs. SaaS?

What does it mean to run an application in the cloud? What types of clouds are there, and what responsibilities can they take away from me? Or conversely, what does it mean not to go to the cloud? To clarify these questions, we first need to identify the...