A Kafka Broker is nothing more, nothing less, than a JVM process that runs on a machine that serves data store/fetch requests from clients, i.e Producers and Consumers.

The data itself is stored in specific directories configured with log.dirs, typically a list of disk mount points of that Broker. You have probably heard before that disk IO is slow, but if Kafka uses disk to store log segments, how is it so performant? Let’s analyze this.

This post covers how traditional data transfer works, what is a Zero Copy optimization and how Kafka benefits from it when combined with the…

Since Spark executes an application in a distributed fashion, it is impossible to atomically write the result of the job. For example, when you write a Dataframe, the result of the operation will be a directory with multiple files in it, one per Dataframe's partition (e.g part-00001-...). These partition files are written by multiple Executors, as a result of their partial computation, but what if one of them fails?

To overcome this Spark has a concept of commit protocol, a mechanism that knows how to write partial results and deal with success or failure of a write operation.

In this…

On the 30th of May, XTech Community promoted a hands-on “Spark Intro & Beyond” meetup. The presentation kicked in with the introduction of Apache Spark, a parallel distributed processing framework, that has been one of the most active Big Data projects over the last years, considering both its usage and all the contributions made by its open-source community.

Spark’s Functionalities

In comparison to Hadoop MapReduce, the main advantages of Spark revolve around its facility to write jobs with multiple steps, through its functional programming API, as well as the ability to store intermediate data in-memory.

Andriy Zabolotnyy

Software & Data Engineer @ tb.lx by Daimler Trucks & Buses

