Philipp Hoffmann

notes to myself

Add S3 support to Spark/Spark-Shell

Out of the box, Spark does not come with S3 support. So running something like this in the spark-shell:

scala> spark.read.parquet("s3a://my-bucket/my-data.parquet").printSchema

will yield something like this:

2018-09-05 09:47:59 WARN  FileStreamSink:66 - Error while looking for metadata directory.
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:705)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.immutable.List.flatMap(List.scala:344)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:388)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:622)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:606)
  ... 49 elided
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
  at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
  ... 69 more

S3 support is easy to add though. We need two things/jars: Support for the S3 filesystem (hadoop-aws) and support for the S3 client (aws-java-sdk).

In order to download the correct versions, first check your Spark distribution for the hadoop version being used by looking into jars/. There are a couple of hadoop jars mentioning the version. For Spark 2.3.1 that would be Hadoop 2.7.3.

Next, find the respective version of hadoop-aws (https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws) on the Maven central and download it. On Maven central, you can also find the version of aws-java-sdk, hadoop-aws is depending on. Download it as well and put both into jars/ in your spark folder.

Configure AWS credentials for Spark (conf/spark-defaults.conf):

spark.hadoop.fs.s3a.access.key YOUR_ACCESS_KEY
spark.hadoop.fs.s3a.secret.key YOUR_SECRET_KEY

Trying to access the data on S3 again should work now:

scala> spark.read.parquet("s3a://my-bucket/my-data.parquet").printSchema
root
 |-- date_time: long (nullable = true)
 |-- width: integer (nullable = true)
 |-- height: integer (nullable = true)
 ...