Apache Spark and Hadoop Sequence Files common pitfalls

Hadoop sequence files are applicable for apache spark

Hadoop sequence files are key-value containers and offer efficient access to apache spark analytics engine. A nearly random access to sections of a sequence files allows spark to split sequence files in parallel processable parts. In this way sequence files are a good choice to store large sets of data for later processing with spark.

Spark offers native access to sequence files and makes its contents available for processing. Simply iterating a sequence file with foreach works in a simple way but also there are some limitations in hadoop api to access these files. These also take in effect when working with spark. Hadoop uses an interface named Writable to access key- and value-pairs. Hadoops sequence file reader uses exactly one instance per key- and value to read the whole file. The following code shows how sequence file reader works:

SequenceFile.Reader reader = new SequenceFile.Reader(conf, Reader.file(seqFilePath)); Text key = new Text(); IntWritable val = new IntWritable(); while (reader.next(key, val)) { System.err.println(key + " " + val); } reader.close();

Pitfalls – the wrong way

As you can see there is exactly one instance “key” and “val” to access all contents of sequence file. In while loop these instances are filled with recent values for current position. The above code is pretty fine to print values to console. But you should not add these writables to a collection. If you would do so your collection will contain a list of the same instance and even the same (last) value.

This effect also happens when working with references to writable instances. A spark map operation associating a tuple of hadoop key-value-pairs to a facade will effect in all facade instances referencing the same objects.

Openshift Templates – An alternative to Helm Charts?

by Markus Breuer | Nov 19, 2019 | Big Data

Openshift templates are Openshift's answer to Kubernetes helm charts. In this way, an openshift template contains a list of objects. In consequence, applying an openshift template substitutes its placeholders. Also, it contains a parameter list. Openshift template are...

Jenkins Pipeline for server less builds

by Markus Breuer | Aug 31, 2019 | Big Data

Jenkins supports using docker container engine. As result, Jenkins pipelines are going towards server less builds. Using the built-in docker plugin in pipelines is pretty simple. In that way, pipeline performs native calls to docker. And in consequence, docker is...

Service Mesh architecture explained

by Markus Breuer | Aug 7, 2019 | Big Data

The term of service mesh is on every ones lips. Many people consider it synonymous with istio. But that is not correct at all. Of course, istio implements a service mesh. Independant from istio there exist different solutions. But what is a service mesh architecture?...

How GIT uses Key Value Store Concept

by Markus Breuer | Jul 14, 2019 | Big Data

GIT is a famous and powerful SCM for professionals. Behind the scenes it is a simple key value store. This article covers, how GIT uses key value store concepts. All high level operation rely on this basic. In this way, if you know how git uses key value store...

Java code cache full – how to measure fill level?

by Markus Breuer | Jul 11, 2019 | Big Data

What is java code cache? Java code cache is an area where JVM stores its byte code compiled into native code. Java 7 introduced the feature of tiered compilation. So compiled byte code from java class files passes the JIT-Compiler. This happens at run time and...

Apache Maven Java Build Managment Tool

by Markus Breuer | Jul 7, 2019 | Big Data

Apache Maven java build tool? Maven is a widely used build management tool. Apache Software Foundation maintains maven in java programming language. So a lot of java projects use maven as build management tool. Maven follows the paradigm convention over configuration....

Apache Spark and Hadoop Sequence Files common pitfalls

Hadoop sequence files are applicable for apache spark

Pitfalls – the wrong way

Going right way

blog

News & Updates

Openshift Templates – An alternative to Helm Charts?

Jenkins Pipeline for server less builds

Service Mesh architecture explained

How GIT uses Key Value Store Concept

Java code cache full – how to measure fill level?

Apache Maven Java Build Managment Tool

Apache Spark and Hadoop Sequence Files common pitfalls

Hadoop sequence files are applicable for apache spark

Pitfalls – the wrong way

Going right way

Sharen with:

Openshift Templates – An alternative to Helm Charts?

Jenkins Pipeline for server less builds

Service Mesh architecture explained

How GIT uses Key Value Store Concept

Java code cache full – how to measure fill level?

Apache Maven Java Build Managment Tool