Business Intelligence

SQW03 Getting Started with Apache Spark

11/20/2019

8:00am - 9:15am

Level: Introductory to Intermediate

Kevin Feasel

Proprietor

Catallaxy Services, LLC

As companies work to gain insight from ever-increasing amounts of data, data platform practitioners need tools which can scale along with the data. Early big data solutions in the Hadoop ecosystem assumed that data sizes overwhelmed available memory, emphasizing heavy disk usage to coordinate work between nodes. As the cost of memory decreases and the amount of memory available per server increases, we see a shift in the makeup of big data systems, emphasizing heavy memory usage instead of disk. Apache Spark, which focuses on memory-intensive operations, has taken advantage of this hardware shift to become the dominant solution for problems requiring distributed data. In this talk, we will take an introductory look at Apache Spark. We will review where it fits in the Hadoop ecosystem, cover how to get started and some of the basic functional programming concepts needed to understand Spark, and see examples of how we can use Spark to solve issues like calculating PageRank and analyzing large data sets.

You will learn:

  • About Apache Spark
  • How to use Spark with Scala as well as Spark SQL
  • Review some of the differences between Spark and a relational technology like SQL Server