Apache Spark and Hadoop Sequence Files common pitfalls

Hadoop sequence files are applicable for apache spark

Hadoop sequence files are key-value containers and offer efficient access to apache spark analytics engine. A nearly random access to sections of a sequence files allows spark to split sequence files in parallel processable parts. In this way sequence files are a good choice to store large sets of data for later processing with spark.

Spark offers native access to sequence files and makes its contents available for processing. Simply iterating a sequence file with foreach works in a simple way but also there are some limitations in hadoop api to access these files. These also take in effect when working with spark. Hadoop uses an interface named Writable to access key- and value-pairs. Hadoops sequence file reader uses exactly one instance per key- and value to read the whole file. The following code shows how sequence file reader works:

SequenceFile.Reader reader = new SequenceFile.Reader(conf, Reader.file(seqFilePath)); Text key = new Text(); IntWritable val = new IntWritable(); while (reader.next(key, val)) { System.err.println(key + " " + val); } reader.close();

Pitfalls – the wrong way

As you can see there is exactly one instance “key” and “val” to access all contents of sequence file. In while loop these instances are filled with recent values for current position. The above code is pretty fine to print values to console. But you should not add these writables to a collection. If you would do so your collection will contain a list of the same instance and even the same (last) value.

This effect also happens when working with references to writable instances. A spark map operation associating a tuple of hadoop key-value-pairs to a facade will effect in all facade instances referencing the same objects.

On-Premise? IaaS vs. PaaS vs. SaaS?

by Markus Breuer | Aug 14, 2022 | Big Data, Software Architecture

What does it mean to run an application in the cloud? What types of clouds are there, and what responsibilities can they take away from me? Or conversely, what does it mean not to go to the cloud? To clarify these questions, we first need to identify the...

Apache Spark & Delta Lake Examples

by Markus Breuer | Jun 11, 2022 | Big Data

Here are some Apache Spark and Delta Lake examples. Actually, they are always the same problems. But searching and finding the solutions on the internet costs a lot of time. Is something still missing? Then just let me know! Preparations The Apache Spark distribution...

Apache Spark fails autocomplete and loading packages

by Markus Breuer | Jun 7, 2022 | Big Data

Running Apache Spark from the Docker image causes problems with autocomplete in the shell. In that way, the apache spark-shell autocomplete is broken. Suspicion falls on the terminal settings and the JLine configuration. In fact, the cause lies in a completely...

Docker Topics – A collection of notes

by Markus Breuer | Aug 7, 2021 | Big Data

A list of common Docker Topics (more): The Haskell Dockerfile Linter helps to build best practice Docker images. The Dockerfile Lint allows the building of custom policies to build best practice Docker images. Use Makefiles for Docker building Docker Images. How...

Make it easy: Apache Spark, Data Frames and Regex Power

by Markus Breuer | Aug 5, 2020 | Big Data

Regular Expressions are a powerful tool to split texts into fragments. Furthermore, Apache Spark is an analytics engine and capable of processing large amounts of data sets. The feature of naming capturing groups makes the usage of regular expressions more accessible....

Openshift mount files to pods

by Markus Breuer | Dec 15, 2019 | Big Data

Openshift offers many possibilities to embed files in pods. Furthermore, there are many reasons to include files in pods. So, embedding configuration files is a powerful mechanism. In this way, unchangeable containers become populated with dynamic content. In brief,...

Apache Spark and Hadoop Sequence Files common pitfalls

Hadoop sequence files are applicable for apache spark

Pitfalls – the wrong way

Going right way

blog

News & Updates

On-Premise? IaaS vs. PaaS vs. SaaS?

Apache Spark & Delta Lake Examples

Apache Spark fails autocomplete and loading packages

Docker Topics – A collection of notes

Make it easy: Apache Spark, Data Frames and Regex Power

Openshift mount files to pods

Apache Spark and Hadoop Sequence Files common pitfalls

Hadoop sequence files are applicable for apache spark

Pitfalls – the wrong way

Going right way

Sharen with:

On-Premise? IaaS vs. PaaS vs. SaaS?

Apache Spark & Delta Lake Examples

Apache Spark fails autocomplete and loading packages

Docker Topics – A collection of notes

Make it easy: Apache Spark, Data Frames and Regex Power

Openshift mount files to pods