One of the biggest challenge of designing Spark transformations is handling skewed datasets. Depending on your database schema and driver, you might be pulling this data already skewed, and you might also need to order it based on columns that are inherently skewed. In this article, after a brief explanation on how to identify skews, I dive into mitigation strategies for ordering over skewed window.
Strategies for skewed Spark datasets: window ordering use case
· 5 min read