Skip to content

sailfish009/flight-spark-source

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark source for Flight enabled endpoints

Build Status

This uses the new Source V2 Interface to connect to Apache Arrow Flight endpoints. It is a prototype of what is possible with Arrow Flight. The prototype has achieved 50x speed up compared to serial jdbc driver and scales with the number of Flight endpoints/spark executors being run in parallel.

It currently supports:

  • Columnar Batch reading
  • Reading in parallel many flight endpoints as Spark partitions
  • filter and project pushdown

It currently lacks:

  • support for all Spark/Arrow data types and filters
  • Strongly tied to Dremio's flight endpoint and should be abstracted to generic Flight sources
  • Needs to be updated to support new features in Arrow 0.15.0
  • write interface to use DoPut to write Spark dataframes back to an Arrow Flight endpoint
  • leverage the transactional capabilities of the Spark Source V2 interface
  • proper benchmark test
  • CI build & tests

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 87.3%
  • Scala 12.7%