Data Engineering & AI/ML Insights

Dataverse Chronicles

Written by Kishore Panda, an AI Engineer sharing insights from years of experience in Data Engineering and AI/ML.

AI and Data Engineering Brain

Featured Posts

Spark Optimisation Tips

6 min read

we’ll explore how to make Apache Spark jobs run efficiently using real-world PySpark examples in Google Colab.From caching and partitioning to Adaptive Query Execution, learn the practical tricks every data engineer needs to master Spark performance.

Multi-Table Inserts in Snowflake: One Scan, Many Targets

9 min read

Snowflake’s Multi-Table Insert feature solves this in a clean and elegant way: you write one SELECT, and Snowflake routes the rows into many target tables

The Architecture of Apache Druid

10 min read

This week, we will dive deep into one of the most famous real-time OLAP systems: Apache Druid. Have you ever wondered how it works? This blog post is noted after reading the paper Druid — A Real-time Analytical Data Store.

Zero-Downtime AWS EMR Deployments

4 min read

Zero-Downtime EMR Deployments: Lessons Learned from Production

Real time streaming using Kafka, schema registry and Spark Glue ETL for Avro records

12 min read

In this article, apart from exploring a robust and scalable solution for real-time data processing using Amazon Managed Streaming for Apache Kafka (MSK), Confluent Schema Registry, and Apache Spark Streaming within AWS Glue ETL, my utmost focus will be on ensuring compatibility & feasibility with Avro schema records using cross platform components like AWS services & confluent services, a popular data serialization format in the Apache Kafka ecosystem. There are abundant articles on real time streaming for json records using MSK & GSR (Glue schema registry) but didn't find anything with confluent schema registry on aws services, hence penning down the article to solve this problem as well.

View All Posts