Dataverse Chronicles| undefined posts

Posts tagged with

Spark Optimisation Tips

November 24, 2025 • 6 min read

we’ll explore how to make Apache Spark jobs run efficiently using real-world PySpark examples in Google Colab.From caching and partitioning to Adaptive Query Execution, learn the practical tricks every data engineer needs to master Spark performance.

Read Post

Real time streaming using Kafka, schema registry and Spark Glue ETL for Avro records

November 16, 2025 • 12 min read

In this article, apart from exploring a robust and scalable solution for real-time data processing using Amazon Managed Streaming for Apache Kafka (MSK), Confluent Schema Registry, and Apache Spark Streaming within AWS Glue ETL, my utmost focus will be on ensuring compatibility & feasibility with Avro schema records using cross platform components like AWS services & confluent services, a popular data serialization format in the Apache Kafka ecosystem. There are abundant articles on real time streaming for json records using MSK & GSR (Glue schema registry) but didn't find anything with confluent schema registry on aws services, hence penning down the article to solve this problem as well.

Read Post