Featured Posts
Spark Optimisation Tips
we’ll explore how to make Apache Spark jobs run efficiently using real-world PySpark examples in Google Colab.From caching and partitioning to Adaptive Query Execution, learn the practical tricks every data engineer needs to master Spark performance.
Multi-Table Inserts in Snowflake: One Scan, Many Targets
Snowflake’s Multi-Table Insert feature solves this in a clean and elegant way: you write one SELECT, and Snowflake routes the rows into many target tables
The Architecture of Apache Druid
This week, we will dive deep into one of the most famous real-time OLAP systems: Apache Druid. Have you ever wondered how it works? This blog post is noted after reading the paper Druid — A Real-time Analytical Data Store.
Zero-Downtime AWS EMR Deployments
Zero-Downtime EMR Deployments: Lessons Learned from Production
Real time streaming using Kafka, schema registry and Spark Glue ETL for Avro records
In this article, apart from exploring a robust and scalable solution for real-time data processing using Amazon Managed Streaming for Apache Kafka (MSK), Confluent Schema Registry, and Apache Spark Streaming within AWS Glue ETL, my utmost focus will be on ensuring compatibility & feasibility with Avro schema records using cross platform components like AWS services & confluent services, a popular data serialization format in the Apache Kafka ecosystem. There are abundant articles on real time streaming for json records using MSK & GSR (Glue schema registry) but didn't find anything with confluent schema registry on aws services, hence penning down the article to solve this problem as well.