Posts tagged with
Spark Optimisation Tips
we’ll explore how to make Apache Spark jobs run efficiently using real-world PySpark examples in Google Colab.From caching and partitioning to Adaptive Query Execution, learn the practical tricks every data engineer needs to master Spark performance.
Real time streaming using Kafka, schema registry and Spark Glue ETL for Avro records
In this article, apart from exploring a robust and scalable solution for real-time data processing using Amazon Managed Streaming for Apache Kafka (MSK), Confluent Schema Registry, and Apache Spark Streaming within AWS Glue ETL, my utmost focus will be on ensuring compatibility & feasibility with Avro schema records using cross platform components like AWS services & confluent services, a popular data serialization format in the Apache Kafka ecosystem. There are abundant articles on real time streaming for json records using MSK & GSR (Glue schema registry) but didn't find anything with confluent schema registry on aws services, hence penning down the article to solve this problem as well.