pyspark: Py4JJavaError: An error occurred while calling o138.loadClass.: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI
I’ve been building a Docker Container that has support for Jupyter, Spark, GraphFrames, and Neo4j, and ran into a problem that had me pulling my (metaphorical) hair out!
The pyspark-notebook container gets us most of the way there, but it doesn’t have GraphFrames or Neo4j support. Adding Neo4j is as simple as pulling in the Python Driver from Conda Forge, which leaves us with GraphFrames.
When I’m using GraphFrames with pyspark locally I would pull it in via the --packages
config parameter, like this:
./bin/pyspark --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
I thought the same approach would work in the Docker container, so I created a Dockerfile that extends jupyter/pyspark-notebook
, and added this code into the SPARK_OPTS
environment variable:
ARG BASE_CONTAINER=jupyter/pyspark-notebook
LABEL maintainer="Mark Needham"
USER root
ENV SPARK_OPTS --driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
RUN conda install --quiet --yes 'conda-forge::neo4j-python-driver' && \
pip install graphframes && \
fix-permissions $CONDA_DIR && \
fix-permissions /home/$NB_USER
I built the Docker image:
docker build .
Successfully built fbcc49e923a6
And then ran it locally:
docker run -p 8888:8888 fbcc49e923a6
[I 08:12:44.168 NotebookApp] The Jupyter Notebook is running at:
[I 08:12:44.168 NotebookApp] http://(1f7d61b2f1de or
[I 08:12:44.168 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 08:12:44.171 NotebookApp]
To access the notebook, open this file in a browser:
Or copy and paste one of these URLs:
http://(1f7d61b2f1de or
I navigated to http://localhost:8888/?token=2f1c9e01326676af1a768b5e573eb9c58049c385a7714e53, which is where the Jupyter notebook is hosted. I uploaded a couple of CSV files, created a Jupyter notebook, and ran the following code:
from pyspark.sql.types import *
from graphframes import *
import pandas as pd
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate('local')
spark = SparkSession(sc)
def create_transport_graph():
node_fields = [
StructField("id", StringType(), True),
StructField("latitude", FloatType(), True),
StructField("longitude", FloatType(), True),
StructField("population", IntegerType(), True)
nodes ="data/transport-nodes.csv", header=True,
schema = StructType(node_fields))
rels ="data/transport-relationships.csv", header=True)
reversed_rels = (rels.withColumn("newSrc", rels.dst)
.withColumn("newDst", rels.src)
.drop("dst", "src")
.withColumnRenamed("newSrc", "src")
.withColumnRenamed("newDst", "dst")
.select("src", "dst", "relationship", "cost"))
relationships = rels.union(reversed_rels)
return GraphFrame(nodes, relationships)
g = create_transport_graph()
Unfortunately it throws the following exception when it tries to read the data/transport-nodes.csv
file on line 18:
Py4JJavaError: An error occurred while calling o138.loadClass.
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI
at java.lang.ClassLoader.loadClass(
at java.lang.ClassLoader.loadClass(
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at py4j.reflection.MethodInvoker.invoke(
at py4j.reflection.ReflectionEngine.invoke(
at py4j.Gateway.invoke(
at py4j.commands.AbstractCommand.invokeMethod(
at py4j.commands.CallCommand.execute(
I Googled the error message, and came across this issue, which has a lot of suggestions for how to fix it. I tried them all!
I passed --packages
as well as SPARK_OPTS
ARG BASE_CONTAINER=jupyter/pyspark-notebook
LABEL maintainer="Mark Needham"
USER root
ENV SPARK_OPTS --driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
ENV PYSPARK_SUBMIT_ARGS --master local[*] pyspark-shell --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
RUN conda install --quiet --yes 'conda-forge::neo4j-python-driver' && \
pip install graphframes && \
fix-permissions $CONDA_DIR && \
fix-permissions /home/$NB_USER
I downloaded the GraphFrames JAR, and referenced it directly using the --jars
ARG BASE_CONTAINER=jupyter/pyspark-notebook
LABEL maintainer="Mark Needham"
USER root
ENV SPARK_OPTS --driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info --jars /home/jovyan/graphframes-0.7.0-spark2.4-s_2.11.jar
ENV PYSPARK_SUBMIT_ARGS --master local[*] pyspark-shell --jars /home/jovyan/graphframes-0.7.0-spark2.4-s_2.11.jar
RUN conda install --quiet --yes 'conda-forge::neo4j-python-driver' && \
pip install graphframes && \
fix-permissions $CONDA_DIR && \
fix-permissions /home/$NB_USER
COPY graphframes-0.7.0-spark2.4-s_2.11.jar /home/$NB_USER/graphframes-0.7.0-spark2.4-s_2.11.jar
I used the --py-files
argument as well:
ARG BASE_CONTAINER=jupyter/pyspark-notebook
LABEL maintainer="Mark Needham"
USER root
ENV SPARK_OPTS --driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info --jars /home/jovyan/graphframes-0.7.0-spark2.4-s_2.11.jar --py-files /home/jovyan/graphframes-0.7.0-spark2.4-s_2.11.jar
ENV PYSPARK_SUBMIT_ARGS --master local[*] pyspark-shell --jars /home/jovyan/graphframes-0.7.0-spark2.4-s_2.11.jar --py-files /home/jovyan/graphframes-0.7.0-spark2.4-s_2.11.jar
RUN conda install --quiet --yes 'conda-forge::neo4j-python-driver' && \
pip install graphframes && \
fix-permissions $CONDA_DIR && \
fix-permissions /home/$NB_USER
COPY graphframes-0.7.0-spark2.4-s_2.11.jar /home/$NB_USER/graphframes-0.7.0-spark2.4-s_2.11.jar
But nothing worked and I still had the same error message :(
I was pretty stuck at this point, and returned to Google, where I found a a StackOverflow thread that had I hadn’t spotted before.
Gilles Essoki suggested copying the GraphFrames JAR directly into the /usr/local/spark/jars
directory, so I updated my Dockerfile to do this:
ARG BASE_CONTAINER=jupyter/pyspark-notebook
LABEL maintainer="Mark Needham"
USER root
ENV SPARK_OPTS --driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info
RUN conda install --quiet --yes 'conda-forge::neo4j-python-driver' && \
pip install graphframes && \
fix-permissions $CONDA_DIR && \
fix-permissions /home/$NB_USER
COPY graphframes-0.7.0-spark2.4-s_2.11.jar /usr/local/spark/jars
I built it again, and this time my CSV files are happily processed! So thankyou Gilles!
If you want to use this Docker container I’ve put it on GitHub at mneedham/pyspark-graphframes-neo4j-notebook, or you can pull it directly from Docker using the following command:
docker pull markhneedham/pyspark-graphframes-neo4j-notebook
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.