
Apache Spark Integration and Platform Execution for ML - ML 073
Adventures in Machine Learning
00:00
Spark Clustering - Partitioning by Columns?
The internal method of a repartition is, it's just done by a hash key. But if you don't specify anything, other than utilizing the default of partition count, which is set at the spark session level, which is defaulted as 200,. You're just going to get 200 partitions across your cluster that are going to contain data. And Spark will attempt to balance it out by estimated size.
Transcript
Play full episode