Spark Clustering - Partitioning by Columns?

The internal method of a repartition is, it's just done by a hash key. But if you don't specify anything, other than utilizing the default of partition count, which is set at the spark session level, which is defaulted as 200,. You're just going to get 200 partitions across your cluster that are going to contain data. And Spark will attempt to balance it out by estimated size.

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app