Skip to content

Conversation

@gaogaotiantian
Copy link
Contributor

What changes were proposed in this pull request?

Add pre-commit yaml file to allow users optionally enable pre-commit hook.

Why are the changes needed?

It's common that users forget to check lint/format and have to wait for 2 hours to realize it. This prevents that situation. Also python reformatter and lint is sub-second so it won't impact user too much.

This is also optional. User won't feel it without explicitly installing pre-commit

Does this PR introduce any user-facing change?

No

How was this patch tested?

Local pre-commit worked.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions
Copy link

JIRA Issue Information

=== Improvement SPARK-55266 ===
Summary: Add pre-commit hook file for python lint/format
Assignee: None
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

@zhengruifeng
Copy link
Contributor

I think we should provide a requirements-lint.txt with all versions pinned, so that developers can run the python linter locally

@gaogaotiantian
Copy link
Contributor Author

Developers can already run linters locally. This PR is to run it "automatically" before each commit. requirements.txt already has all the required tools.

@zhengruifeng
Copy link
Contributor

zhengruifeng commented Jan 29, 2026

I see, does dev/requirements exactly match the linter image now? The mypy check always fail in my local

get this after reinstall dev/requirements

starting mypy annotations test...
annotations failed mypy checks:
python/pyspark/ml/linalg/__init__.py:407: error: Unused "type: ignore" comment  [unused-ignore]
python/pyspark/pandas/series.py:1085: error: Unused "type: ignore" comment  [unused-ignore]
python/pyspark/pandas/series.py:1087: error: Unused "type: ignore" comment  [unused-ignore]
python/pyspark/pandas/indexes/base.py:550: error: Unused "type: ignore[arg-type]" comment  [unused-ignore]
python/pyspark/mllib/linalg/__init__.py:460: error: Variable "numpy._typing.ArrayLike" is not valid as a type  [valid-type]
python/pyspark/mllib/linalg/__init__.py:460: note: See https://mypy.readthedocs.io/en/stable/common_issues.html#variables-vs-type-aliases
Found 5 errors in 4 files (checked 1194 source files)


- id: ruff
name: Lint Python code with Ruff
entry: ./dev/lint-python --ruff
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, we don't check mypy which is pretty slow (and flaky in my local).
Then LGTM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is slow because mypy is checking all the dependencies. can we let it not to do it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy is slow, especially without cache. It's pure python. astral and meta are both working on a rust-based type checker which are significantly faster than mypy, but neither of them is stable enough for production use.

@gaogaotiantian
Copy link
Contributor Author

gaogaotiantian commented Jan 29, 2026

Yeah mypy is much slower and not as consistent across environments. We only do black + ruff for pre-commit.

Are there a lot of error reports for your mypy? I've updated some requirements while I'm fixing mypy, did you upgrade to the latest requirements.txt? There should be a few errors but not a lot. I'm working on a more stable mypy - including upgrading the version of mypy. Older versions have more inconsistent results on different library versions.

@gaogaotiantian
Copy link
Contributor Author

The mypy check always fail in my local

I believe you are running lint on 3.12. The lint image is 3.11.

@Yicong-Huang
Copy link
Contributor

Love this feature!

The mypy check always fail in my local

I believe you are running lint on 3.12. The lint image is 3.11.

BTW, updating lint image to 3.12 here in #54042.

@zhengruifeng
Copy link
Contributor

zhengruifeng commented Jan 29, 2026

The mypy check always fail in my local

I believe you are running lint on 3.12. The lint image is 3.11.

I have been using 3.13 for some time :)

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants