Data pipelines are commonly used to gather and transform data from different source systems. In Azure a common method for doing extract, transfer and load are data flows in Azure Data Factory or Integration Pipelines in Azure Synapse. While this method may be easy to code it is not the easiest thing to troubleshoot as there are many parts contained within it and it can be expensive to run. Conversely creating all of your ETL in Python can be difficult to write and maintain. In this session, you will learn how to use Spark SQL and Python to create notebooks which are called from integration pipelines to create an efficient, scalable, maintainable solution to create data migration and transfer tasks. You will want to incorporate this design pattern into your Azure data development environments when you see how easy it is to create and read this code.
You will learn:
- Understand how you can leverage your SQL knowledge in Apache Spark
- Create design patterns for moving data with reduced Azure Spend
- How to incorporate Spark notebooks to improve pipeline design patterns