Java Garbage Collection Overview

Feb 26 2019 jvm 9 minutes read (About 1402 words)

Java Garbage Collection has confused me for such a long time when I try to tune my Spark Application, but unfortunately I’m not a good Java developer. I really feel terrible when staring at the red blocks representing high GC time in my SparkUI while having no idea how to fix it up. So I spent some time digging in GC and finally got to learn about what GC is and how to analysis the GC logs. So today I’m sharing you something I learnt and let’s move on.

Second Generation Tungsten Engine in Spark 2.x

Nov 14 2018 spark 10 minutes read (About 1455 words)

This article is about the 2nd generation Tungsten engine, which is the core project to optimize Spark performance. Compared with the 1st generation Tungsten engine, the 2nd one mainly focuses on optimizing query plan and speeding up query execution, which is a pretty aggressive goal to get orders of magnitude faster performance. Let’s take a look!

Spark Tips Sum-up Part-2

Oct 13 2018 spark 6 minutes read (About 854 words)

This article presents some tips of Apache Spark, which is part 2 of the series. All the tips below are based on the real problems which I met. Despite the background, the tips below are of valuable reference. I’ve tried a lot to learn about Apache Spark but can’t know the detail of every part of it. I’d appreciate it if you figure out the mistakes in this article.

Catalyst Optimization in Spark SQL

Sep 25 2018 spark 8 minutes read (About 1256 words)

Spark SQL is one of the most important components of Apache Spark and has become the fundation of Structure Streaming, ML Pipeline, GraphFrames and so on since Spark 2.0. Also, Spark SQL provides SQL queries and DataFrame/Dataset API, both of which are optimized by Catalyst. Let’s talk about Catalyst today.

From Spark RDD to DataFrame/Dataset

Sep 22 2018 spark 7 minutes read (About 1075 words)

This article presents the relationship between Spark RDD, DataFrame and Dataset, and talks about both the advantages and disadvantages of them. RDD is the fundamental API since the inception of Spark and DataFrame/Dataset API is also pretty popular since Spark 2.0. What’s the differences between them and how to decide which API to be imported, let’s have a quick look.

Java Garbage Collection Overview

Second Generation Tungsten Engine in Spark 2.x

Spark Tips Sum-up Part-2

Catalyst Optimization in Spark SQL

From Spark RDD to DataFrame/Dataset

Your browser is out-of-date!