· spark-2

SparkR: Error in invokeJava(isStatic = TRUE, className, methodName, ...) : java.lang.ClassNotFoundException: Failed to load class for data source: csv.

I've been wanting to play around with SparkR for a while and over the weekend deciding to explore a large Land Registry CSV file containing all the sales of properties in the UK over the last 20 years.

First I started up the SparkR shell with the CSV package loaded in:


./spark-1.5.0-bin-hadoop2.6/bin/sparkR --packages com.databricks:spark-csv_2.11:1.2.0

Next I tried to read the CSV file into a Spark data frame by modifying one of the examples from the tutorial:


> sales <- read.df(sqlContext, "pp-complete.csv", "csv")
15/09/20 19:13:02 ERROR RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  java.lang.ClassNotFoundException: Failed to load class for data source: csv.
	at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:67)
	at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:87)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
	at org.apache.spark.sql.api.r.SQLUtils$.loadDF(SQLUtils.scala:156)
	at org.apache.spark.sql.api.r.SQLUtils.loadDF(SQLUtils.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:132)
	at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:79)
	at org.apache.spark.api.r.RBackendH

As far as I can tell I have loaded in the CSV data source so I'm not sure why that doesn't work.

However, I came across this github issue which suggested passing in the full package name as the 3rd argument of 'read.df' rather than just 'csv':


> sales <- read.df(sqlContext, "pp-complete.csv", "com.databricks.spark.csv", header="false")
> sales
DataFrame[C0:string, C1:string, C2:string, C3:string, C4:string, C5:string, C6:string, C7:string, C8:string, C9:string, C10:string, C11:string, C12:string, C13:string, C14:string]

And that worked much better! We can now carry on and do some slicing and dicing of the data to see if there are any interesting insights.

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket