Pyspark Linting Rules #7272

sbrugman · 2023-09-11T16:32:18Z

Apache Spark is widely used in the python ecosystem for distributed computing. As user of spark I would like for ruff to lint problematic behaviours. The automation that ruff offers is especially useful in projects with various levels of software engineering skills, e.g. where people has more of a statistics background.

There exists a pyspark style guide and pylint extension.

I would like to start contributing a rule that checks for repeated use of withColumn:

This method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select() with multiple columns at once.

This violation seems common in existing code bases.
Are you ok with a PR introducing "Spark-specific rules" (e.g. SPK)?

SPK001: repeated withColumn usage, use withColumns or select
SPK002: repeated withColumnRenamed usage, use withColumnsRenamed
SPK003: repeated drop usage, consolidate in single call
SPK004: F.date_format with simple argument, replace with specialised function (e.g. F.hour)
SPK005: direct access column selection (e.g. F.lower(df.col)), use implicit column selection (e.g. F.lower(F.col("col")))
SPK006: unnecessary F.col (in F.lower(F.col('my_column'))), use F.lower('my_column').

ruff includes rules that are specific to third party libraries: numpy, pandas and airflow. Spark support would be a nice addition.

I would like to close with the following thought: supporting third-party packages may at first seem to be effort in the long tail of possible rules to add to ruff. Why not focus only on rules that affect all Python users? I hope that adding these will lead to creating helper functions that make adding new rules easier. I also think that these libraries will end up with similar API design patterns, that can be linted across the ecosystem. As an example, call chaining is common for many packages that perform transformations.

The text was updated successfully, but these errors were encountered:

charliermarsh · 2023-09-12T01:10:16Z

I'm generally open to adding package-specific rule sets for extremely popular packages (as with Pandas, NumPy, etc.), and Spark would fit that description. However, it'd be nice to have a few rules lined up before we move forward and add any one of them. Otherwise, we run the risk that we end up with really sparse categories that only contain a rule or two.

sbrugman · 2023-09-12T06:25:56Z

Super. I've updated the issue with a couple of rules that we can track. I'll kick off with SPK001-3

guilhem-dvr · 2024-02-09T09:30:55Z

Hi, I was looking for such a thread.

To add to the proposed list, here are some rules we wish we had at my company:

unnecessary drop followed by a select
use unionByName instead of union / unionAll
use df.writeTo(...).append() instead of df.write.insertInto(...)
use df.writeTo(...).overwritePartitions() instead of df.write.insertInto(..., overwrite=True)
replace udf with native spark functions
alias pyspark.sql.functions to F -> from pyspark.sql import ..., functions as F, ...

amadeuspzs · 2024-02-28T15:34:28Z

Just to add that I would be interested in this functionality.

Also, the first link in the original post is broken, and the pylint extension looks unmaintained?

https://github.com/palantir/pyspark-style-guide is a written PySpark guide, and contains some pylint implementations under https://github.com/palantir/pyspark-style-guide/tree/develop/src/checkers
https://nhsdigital.github.io/rap-community-of-practice/training_resources/pyspark/pyspark-style-guide/ is based off the guide above

stkrzysiak · 2024-06-25T16:55:17Z

Super. I've updated the issue with a couple of rules that we can track. I'll kick off with SPK001-3

Did you get going with this? Thinking about jumping on it

charliermarsh added the plugin Implementing a known but unsupported plugin label Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pyspark Linting Rules #7272

Pyspark Linting Rules #7272

sbrugman commented Sep 11, 2023 •

edited

Loading

charliermarsh commented Sep 12, 2023 •

edited

Loading

sbrugman commented Sep 12, 2023 •

edited

Loading

guilhem-dvr commented Feb 9, 2024 •

edited

Loading

amadeuspzs commented Feb 28, 2024

stkrzysiak commented Jun 25, 2024

Pyspark Linting Rules #7272

Pyspark Linting Rules #7272

Comments

sbrugman commented Sep 11, 2023 • edited Loading

charliermarsh commented Sep 12, 2023 • edited Loading

sbrugman commented Sep 12, 2023 • edited Loading

guilhem-dvr commented Feb 9, 2024 • edited Loading

amadeuspzs commented Feb 28, 2024

stkrzysiak commented Jun 25, 2024

sbrugman commented Sep 11, 2023 •

edited

Loading

charliermarsh commented Sep 12, 2023 •

edited

Loading

sbrugman commented Sep 12, 2023 •

edited

Loading

guilhem-dvr commented Feb 9, 2024 •

edited

Loading