Skip to content

Conversation

@Mantisus
Copy link
Collaborator

@Mantisus Mantisus commented Dec 13, 2024

Description

Fixes HTTP status code configuration behavior:

  • additional_http_error_status_codes: Properly triggers errors and retries when specified status codes are encountered
  • ignore_http_error_status_codes: Correctly treats specified error status codes as successful responses

Issues

@Mantisus
Copy link
Collaborator Author

Mantisus commented Dec 13, 2024

I have already discussed this with @janbuchar. This PR should not introduce any breaking changes. If additional_http_error_status_codes and ignore_http_error_status_codes are not specified, the previous behavior is preserved.

Status codes were previously discussed in PR: #498

However, we have differences with the JS version of crawlee:

Python:

  • 400-499 codes cause errors without retries

JS:

  • 400-499 codes are treated as successful
  • Exceptions:
    • 401, 403, 429 - cause errors with retries
    • 406 - causes error without retries and ignores ignoreHttpErrorStatusCodes parameter

@Mantisus Mantisus self-assigned this Dec 13, 2024
@Mantisus Mantisus marked this pull request as ready for review December 13, 2024 12:43
@Mantisus Mantisus requested review from Pijukatel, janbuchar and vdusek and removed request for vdusek December 13, 2024 13:21
Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions, otherwise LGTM... much better!

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one last naming thing, otherwise LGTM.

@vdusek
Copy link
Collaborator

vdusek commented Dec 16, 2024

However, we have differences with the JS version of crawlee:

Python:

400-499 codes cause errors without retries
JS:

400-499 codes are treated as successful
Exceptions:
401, 403, 429 - cause errors with retries
406 - causes error without retries and ignores ignoreHttpErrorStatusCodes parameter

Any ideas how we want to align it? I believe we do not have issue for that, correct? cc @janbuchar

@Mantisus Mantisus requested a review from janbuchar December 17, 2024 02:15
@janbuchar
Copy link
Collaborator

janbuchar commented Dec 17, 2024

However, we have differences with the JS version of crawlee:
Python:
400-499 codes cause errors without retries
JS:
400-499 codes are treated as successful
Exceptions:
401, 403, 429 - cause errors with retries
406 - causes error without retries and ignores ignoreHttpErrorStatusCodes parameter

Any ideas how we want to align it? I believe we do not have issue for that, correct? cc @janbuchar

I believe apify/crawlee#812 tracks this. I also think that what we do in this PR is the correct way to handle things. The "success on 4xx" in JS is a crutch that prevented retries on such errors, and we can do that more cleanly now (in both implementations).

How does the Python version handle 406 again? Can it be ignored via ignoreHttpErrorStatusCodes?

@Mantisus Mantisus requested a review from janbuchar December 17, 2024 13:47
Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing - it looks like there are no tests for this behavior - could you add some? test_http_crawler would be an appropriate place I guess.

@janbuchar janbuchar self-requested a review December 18, 2024 16:41
@janbuchar janbuchar merged commit 8daf4bd into apify:master Dec 18, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unable to configure retry behavior for status codes 400-499 Refactor handling of ignored_http_status_codes and SessionError

3 participants