-
Notifications
You must be signed in to change notification settings - Fork 924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pyspark Linting Rules #7272
Comments
I'm generally open to adding package-specific rule sets for extremely popular packages (as with Pandas, NumPy, etc.), and Spark would fit that description. However, it'd be nice to have a few rules lined up before we move forward and add any one of them. Otherwise, we run the risk that we end up with really sparse categories that only contain a rule or two. |
Super. I've updated the issue with a couple of rules that we can track. I'll kick off with SPK001-3 |
Hi, I was looking for such a thread. To add to the proposed list, here are some rules we wish we had at my company:
|
Just to add that I would be interested in this functionality. Also, the first link in the original post is broken, and the pylint extension looks unmaintained?
|
Did you get going with this? Thinking about jumping on it |
Apache Spark is widely used in the python ecosystem for distributed computing. As user of spark I would like for ruff to lint problematic behaviours. The automation that ruff offers is especially useful in projects with various levels of software engineering skills, e.g. where people has more of a statistics background.
There exists a pyspark style guide and pylint extension.
I would like to start contributing a rule that checks for repeated use of withColumn:
This violation seems common in existing code bases.
Are you ok with a PR introducing "Spark-specific rules" (e.g.
SPK
)?withColumn
usage, usewithColumns
orselect
withColumnRenamed
usage, usewithColumnsRenamed
drop
usage, consolidate in single callF.date_format
with simple argument, replace with specialised function (e.g.F.hour
)F.lower(df.col)
), use implicit column selection (e.g.F.lower(F.col("col"))
)F.lower(F.col('my_column'))
), useF.lower('my_column')
.ruff
includes rules that are specific to third party libraries: numpy, pandas and airflow. Spark support would be a nice addition.I would like to close with the following thought: supporting third-party packages may at first seem to be effort in the long tail of possible rules to add to
ruff
. Why not focus only on rules that affect all Python users? I hope that adding these will lead to creating helper functions that make adding new rules easier. I also think that these libraries will end up with similar API design patterns, that can be linted across the ecosystem. As an example, call chaining is common for many packages that perform transformations.The text was updated successfully, but these errors were encountered: