-
Notifications
You must be signed in to change notification settings - Fork 10
CCMRG-1914: Fix infinite retry loop bug with visibility timeout re-deliveries #598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… timeout re-deliveries
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (77.11%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #598 +/- ##
==========================================
- Coverage 93.87% 93.83% -0.05%
==========================================
Files 1284 1284
Lines 46667 46751 +84
Branches 1522 1522
==========================================
+ Hits 43809 43867 +58
- Misses 2548 2574 +26
Partials 310 310
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
…le analysis processor The bundle_analysis_processor was calling update_upload() which only flushed changes to the database session but never committed them. This meant that if the task was re-delivered due to visibility timeout expiration, the uncommitted transaction would be lost, causing the upload state to not persist and leading to infinite redispatches. This fix adds explicit db_session.commit() calls: - After successful upload processing to persist the 'processed' state - After error handling to persist the 'error' state - Also fixes the finally block to safely handle cases where result might not be defined if an exception occurs before processing
… detection Bug 1: Wrap db_session.commit() in try-except to preserve original exception if commit fails. This ensures the original exception is always re-raised, preventing incorrect error handling or retry behavior. Bug 2: Only log re-deliveries when total_attempts > retry_count + 1, not just when total_attempts is not None. This prevents spamming logs with 're-delivery detected' messages for every normal retry and only logs actual visibility timeout re-deliveries.
|
@sentry review |
- Add try-except for int() conversion in _get_total_attempts() to handle malformed header values gracefully - Add try-except for int() conversion in re-delivery detection to prevent crashes on invalid header values - Add breadcrumb notification on final failure in upload_finisher when max retries exceeded - Update safe_retry() docstring to clarify that Retry exception propagation is intentional and return True is unreachable by design
thomasrockhu-codecov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two main points
(1) there is a lot of code here that needs to be refactored, please review it
(2) the use of +1 everywhere is definitely a code smell, please take a look to see if it's really necessary
| Total number of attempts (retries + re-deliveries) | ||
| """ | ||
| if not hasattr(self, "request"): | ||
| return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this should return None instead no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because this is an int counter, and I don't think that None can be incremented. I think we want 0 in this case. Unless python hates that.
| # Use header value if it exists (includes re-deliveries) | ||
| try: | ||
| return int(total_attempts_header) | ||
| except (ValueError, TypeError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems... very protective. can we do the check earlier with how total_attempts_header is called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to identify a place. I think it's only used here. However, we are editing a proporty that isn't always set, and isn't generally modified outside of celery itself.
| task.request.get = lambda key, default=None: {} if key == "headers" else default | ||
| task.request.headers = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this doing? seems more complicated than it needs to be for test setup
| if task and task.request: | ||
| log_context.task_name = task.name | ||
| log_context.task_id = task.request.id | ||
| task_id = getattr(task.request, "id", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we sure that task.request.id can be None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not positive, better safe then sorry, unless you would like the check removed
| upload.state_id = UploadState.ERROR.db_id | ||
| upload.state = "error" | ||
| try: | ||
| db_session.commit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can db_session.commit() not be in the finally block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want it to always commit. I think that it isn't in some cases, but where would you prefer it be?
…r and enhance its initialization with detailed parameters
|
@sentry review |
| extra={"value": attempts_header, "retry_count": retry_count}, | ||
| ) | ||
| return retry_count + 1 | ||
| return getattr(self.request, "retries", 0) + 1 | ||
|
|
||
| Returns: | ||
| True if retry was scheduled | ||
| False if max retries exceeded | ||
| def _has_exceeded_max_attempts(self, max_retries: int | None) -> bool: | ||
| """Check if task has exceeded max attempts (including re-deliveries).""" | ||
| if max_retries is None: | ||
| return False | ||
|
|
||
| Example: | ||
| if some_condition_requires_retry: | ||
| if not self.safe_retry(max_retries=5, countdown=60): | ||
| # Max retries exceeded | ||
| log.error("Giving up after too many retries") | ||
| return {"success": False, "reason": "max_retries"} | ||
| max_attempts = max_retries + 1 | ||
| return self.attempts >= max_attempts | ||
|
|
||
| def safe_retry(self, max_retries=None, countdown=None, exc=None, **kwargs): | ||
| """Safely retry with max retry limit and proper metrics tracking. | ||
| Returns False if max retries exceeded, otherwise raises Retry exception. | ||
| Unlike self.retry(), this checks max attempts BEFORE retrying and returns | ||
| False instead of raising MaxRetriesExceededError. |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| }, | ||
| tags={"error_type": "max_retries_exceeded", "task": self.name}, | ||
| ) | ||
| return False |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| }, | ||
| tags={ | ||
| "error_type": "lock_max_retries_exceeded", | ||
| "lock_name": lock_name, | ||
| "lock_type": lock_type.value, | ||
| }, | ||
| ) | ||
| # TODO: should we raise this, or would a return be ok? |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| previous_result, | ||
| ) | ||
| except LockRetry as retry: | ||
| self.retry(countdown=retry.countdown) | ||
| if self._has_exceeded_max_attempts(self.max_retries): | ||
| attempts = self.attempts | ||
| max_attempts = self.max_retries + 1 | ||
| log.error( | ||
| "Bundle analysis processor exceeded max retries", | ||
| extra={ | ||
| "attempts": attempts, | ||
| "commitid": commitid, | ||
| "max_attempts": max_attempts, | ||
| "max_retries": self.max_retries, | ||
| "repoid": repoid, | ||
| }, | ||
| ) | ||
| return previous_result | ||
| if not self.safe_retry( | ||
| max_retries=self.max_retries, countdown=retry.countdown | ||
| ): | ||
| attempts = self.attempts | ||
| log.error( | ||
| "Failed to schedule retry for bundle analysis processor", | ||
| extra={ | ||
| "attempts": attempts, | ||
| "commitid": commitid, | ||
| "repoid": repoid, | ||
| }, | ||
| ) | ||
| return previous_result |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| "Attempting to retry bundle analysis upload", | ||
| extra={ | ||
| "commitid": commitid, | ||
| "repoid": repoid, | ||
| "commit": commitid, | ||
| "commit_yaml": commit_yaml, | ||
| "params": params, | ||
| "result": result.as_dict(), |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| @@ -107,25 +143,64 @@ | |||
| ) | |||
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| exc_info=True, | ||
| ) | ||
| headers = getattr(self.request, "headers", {}) | ||
| return headers or {} | ||
|
|
||
| @property | ||
| def attempts(self) -> int: | ||
| """Get attempts including re-deliveries from visibility timeout expiration. | ||
| Returns: | ||
| - Header value if present and valid (most accurate) | ||
| - retry_count + 1 if header missing/invalid (best guess based on retry count) | ||
| - 0 if request unavailable (rare, safe default for comparisons) | ||
| Returns int (not None) to be safe for comparisons and logging without null checks. | ||
| """ | ||
| Safely retry with max retry limit and proper metrics tracking. | ||
| if not hasattr(self, "request") or self.request is None: | ||
| return 0 |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
…ith detailed retry information
Problem
Tasks were getting stuck in infinite retry loops at retry 5 and never exceeding max retries. This happened when:
request.retries = 5)request.retriesdoesn't incrementAdditionally, bundle analysis processor tasks were completing successfully but uploads weren't being marked as complete, causing infinite redispatches. This occurred because:
update_upload()only calleddb_session.flush()but neverdb_session.commit()wrap_up_dbsession()ran, the transaction was lostSolution
db_session.commit()in bundle analysis processor to persist upload state immediatelyChanges
Core Fixes
_get_total_attempts()method to track total attempts including re-deliveriessafe_retry()to check both retry count and total attemptstotal_attempts=1in task headers when tasks are createdtotal_attemptsin headers when retryingupload_finisher.pyto use new retry logicBundle Analysis Processor Fix
db_session.commit()afterupdate_upload()to persist upload statedb_session.commit()in error handler to persist error statefinallyblock to safely handle cases whereresultmight not be definedLockManager Enhancements
Tests
total_attemptsheaderTesting
Fixes CCMRG-1914