Apache Spark and Hadoop Sequence Files common pitfalls

by | Jul 20, 2018 | Big Data

Hadoop sequence files are applicable for apache spark

Hadoop sequence files are key-value containers and offer efficient access to apache spark analytics engine. A nearly random access to sections of a sequence files allows spark to split sequence files in parallel processable parts. In this way sequence files are a good choice to store large sets of data for later processing with spark.

Spark offers native access to sequence files and makes its contents available for processing. Simply iterating a sequence file with foreach works in a simple way but also there are some limitations in hadoop api to access these files. These also take in effect when working with spark. Hadoop uses an interface named Writable to access key- and value-pairs. Hadoops sequence file reader uses exactly one instance per key- and value to read the whole file. The following code shows how sequence file reader works:

SequenceFile.Reader reader = 
new SequenceFile.Reader(conf, Reader.file(seqFilePath));
Text key = new Text();
IntWritable val = new IntWritable();
while (reader.next(key, val)) {
System.err.println(key + "	" + val);

Pitfalls – the wrong way

As you can see there is exactly one instance “key” and “val” to access all contents of sequence file. In while loop these instances are filled with recent values for current position. The above code is pretty fine to print values to console. But you should not add these writables to a collection. If you would do so your collection will contain a list of the same instance and even the same (last) value.

This effect also happens when working with references to writable instances. A spark map operation associating a tuple of hadoop key-value-pairs to a facade will effect in all facade instances referencing the same objects.

rdd.map( tuple -> new Facade(tuple._1,tuple._2));  // wrong

Going right way

This example will create many facade instances referencing the same writable instances. Solution is simple, just get current values out of the writables:

rdd.map( tuple -> new Facade(tuple._1.get(),tuple._2.get())); // right

Java string use copy-on-write semantic, so after iterating to next element, get() will return to a new object containing the recent values. The resulting rdd contains a list of unique elements.