Java Garbage Collection has confused me for such a long time when I try to tune my Spark Application, but unfortunately I’m not a good Java developer. I really feel terrible when staring at the red blocks representing high GC time in my SparkUI while having no idea how to fix it up. So I spent some time digging in GC and finally got to learn about what GC is and how to analysis the GC logs. So today I’m sharing you something I learnt and let’s move on.

Read More


This article is about the 2nd generation Tungsten engine, which is the core project to optimize Spark performance. Compared with the 1st generation Tungsten engine, the 2nd one mainly focuses on optimizing query plan and speeding up query execution, which is a pretty aggressive goal to get orders of magnitude faster performance. Let’s take a look!

Read More


This article presents some tips of Apache Spark, which is part 2 of the series. All the tips below are based on the real problems which I met. Despite the background, the tips below are of valuable reference. I’ve tried a lot to learn about Apache Spark but can’t know the detail of every part of it. I’d appreciate it if you figure out the mistakes in this article.

Read More


Spark SQL is one of the most important components of Apache Spark and has become the fundation of Structure Streaming, ML Pipeline, GraphFrames and so on since Spark 2.0. Also, Spark SQL provides SQL queries and DataFrame/Dataset API, both of which are optimized by Catalyst. Let’s talk about Catalyst today.

Read More


This article presents the relationship between Spark RDD, DataFrame and Dataset, and talks about both the advantages and disadvantages of them. RDD is the fundamental API since the inception of Spark and DataFrame/Dataset API is also pretty popular since Spark 2.0. What’s the differences between them and how to decide which API to be imported, let’s have a quick look.

Read More

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×