Apache Spark fails autocomplete and loading packages

by | Jun 7, 2022 | Big Data

Running Apache Spark from the Docker image causes problems with autocomplete in the shell. In that way, the apache spark-shell autocomplete is broken. Suspicion falls on the terminal settings and the JLine configuration. In fact, the cause lies in a completely different place.

The story background

The Apache Spark Shell is started from the bitnami/spark Docker image. The shell opens normally, but autocomplete does not work. Instead of suggestions to complete the input, strange control characters appear in the terminal. The cursor cannot be positioned in the line and access to the history is also not possible.
Elsewhere, an attempt is made to load an external package when the spark shell is started. Here, too, strange errors occur. Both errors seem to be independent of each other at first. However, further analysis finds a commonality.

Terminal Settings

The Linux terminal settings are various. Environment variables control character sets or character encoding. These also include control characters such as cursor keys or the tabulator key. Many settings are made through environment variables in the shell. For remote access, for example. Putty offers default settings for access. Changing the settings has no effect. In fact, cursor keys and tabulators work in bash. However, if you open the Spark shell (in the container) from here, the problems described at the beginning occur.

JLine settings

The next starting point for restriction is the JLine library. Scala and thus also spark-shell use this extension for the interactive shell. Various terminal emulations are given as parameters in the command line. Unfortunately, all settings remain ineffective here as well. Completion still does not work.

Loading external packages

Independently of the ones described, another problem occurs. The spark-shell allows loading external artifacts by setting the –packages option in the command line. While the option does its job in the locally installed Spark, errors occur in the Docker container. In the container, the output indicates missing access rights in the file system. Initially, no connection is seen between the two problems.

Analyzing the Problem

On closer examination, the connection between the two problems becomes clear. The files from the Spark installation are stored with root privileges. Both the folder itself and all the files it contains belong to the root user. The container itself is executed with another user. This user lacks the right to create a folder in the working directory:
  • When started, spark-shell creates a folder ./tmp in the current directory. This is not possible due to a lack of rights. As a result, the creation fails. The spark shell starts and the completion is faulty. Apache Spark shell autocomplete is broken.
  • The situation is similar to the external packages. Here, a folder ./ivy is created in the current directory. This fails due to missing rights in the file system. When loading the artifacts, an exception occurs and the artifacts are missing in the shell.

Conclusion

The error can be traced in the project’s issue tracker. It is recorded under #85 and described with detailed terminal output. Finally, it is confirmed as an error and a bug fix is created.
The error analysis was not easy, all clues pointed to the shell environment or the terminal settings. In the end, however, it was only missing access rights. Starting with root privileges is a temporary workaround, but with the new fix, the Spark Shell can now be run in userspace again.