Skip to content

Conversation

@drazisil-codecov
Copy link
Contributor

@drazisil-codecov drazisil-codecov commented Dec 4, 2025

Problem

Tasks were getting stuck in infinite retry loops at retry 5 and never exceeding max retries. This happened when:

  1. Task reaches retry 5 (request.retries = 5)
  2. Visibility timeout expires before task completes
  3. Redis re-queues the task
  4. Celery treats it as a new delivery (not a retry)
  5. request.retries doesn't increment
  6. Task stays at retry 5 and loops forever

Additionally, bundle analysis processor tasks were completing successfully but uploads weren't being marked as complete, causing infinite redispatches. This occurred because:

  1. update_upload() only called db_session.flush() but never db_session.commit()
  2. If visibility timeout expired before wrap_up_dbsession() ran, the transaction was lost
  3. Upload state changes were never persisted to the database
  4. Uploads remained in "pending" state and kept getting redispatched

Solution

  • Track total attempts in task headers (including re-deliveries from visibility timeout expiration)
  • Check both retries and total attempts before retrying to prevent infinite loops
  • Enhanced LockManager to properly log errors and send Sentry exceptions when max retries are hit
  • Add explicit db_session.commit() in bundle analysis processor to persist upload state immediately

Changes

Core Fixes

  • Added _get_total_attempts() method to track total attempts including re-deliveries
  • Updated safe_retry() to check both retry count and total attempts
  • Initialize total_attempts=1 in task headers when tasks are created
  • Increment total_attempts in headers when retrying
  • Updated upload_finisher.py to use new retry logic

Bundle Analysis Processor Fix

  • Added explicit db_session.commit() after update_upload() to persist upload state
  • Added explicit db_session.commit() in error handler to persist error state
  • Fixed finally block to safely handle cases where result might not be defined

LockManager Enhancements

  • Added Sentry exception reporting when max retries exceeded
  • Enhanced error logging with full context (repoid, commitid, lock_name, report_type, etc.)
  • Added proper error tags for filtering in Sentry

Tests

  • Fixed all failing tests to account for new total_attempts header
  • Updated test assertions to match new retry logic

Testing

  • All existing tests pass
  • New logic prevents infinite loops by tracking total attempts
  • LockManager properly reports failures to Sentry
  • Bundle analysis processor now persists upload state even if task is re-delivered

Fixes CCMRG-1914

@linear
Copy link

linear bot commented Dec 4, 2025

@sentry
Copy link

sentry bot commented Dec 4, 2025

Codecov Report

❌ Patch coverage is 77.11864% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.83%. Comparing base (6db2ae8) to head (905730a).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
apps/worker/tasks/bundle_analysis_processor.py 51.85% 13 Missing ⚠️
apps/worker/tasks/upload_finisher.py 20.00% 8 Missing ⚠️
apps/worker/tasks/base.py 90.16% 6 Missing ⚠️

❌ Your patch check has failed because the patch coverage (77.11%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #598      +/-   ##
==========================================
- Coverage   93.87%   93.83%   -0.05%     
==========================================
  Files        1284     1284              
  Lines       46667    46751      +84     
  Branches     1522     1522              
==========================================
+ Hits        43809    43867      +58     
- Misses       2548     2574      +26     
  Partials      310      310              
Flag Coverage Δ
workerintegration 58.46% <30.50%> (-0.14%) ⬇️
workerunit 91.09% <77.11%> (-0.12%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codecov-eu
Copy link

codecov-eu bot commented Dec 4, 2025

Codecov Report

❌ Patch coverage is 77.11864% with 27 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
apps/worker/tasks/bundle_analysis_processor.py 51.85% 13 Missing ⚠️
apps/worker/tasks/upload_finisher.py 20.00% 8 Missing ⚠️
apps/worker/tasks/base.py 90.16% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

@codecov-notifications
Copy link

codecov-notifications bot commented Dec 4, 2025

Codecov Report

❌ Patch coverage is 77.11864% with 27 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
apps/worker/tasks/bundle_analysis_processor.py 51.85% 13 Missing ⚠️
apps/worker/tasks/upload_finisher.py 20.00% 8 Missing ⚠️
apps/worker/tasks/base.py 90.16% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

…le analysis processor

The bundle_analysis_processor was calling update_upload() which only
flushed changes to the database session but never committed them. This
meant that if the task was re-delivered due to visibility timeout
expiration, the uncommitted transaction would be lost, causing the
upload state to not persist and leading to infinite redispatches.

This fix adds explicit db_session.commit() calls:
- After successful upload processing to persist the 'processed' state
- After error handling to persist the 'error' state
- Also fixes the finally block to safely handle cases where result
  might not be defined if an exception occurs before processing
… detection

Bug 1: Wrap db_session.commit() in try-except to preserve original
exception if commit fails. This ensures the original exception is
always re-raised, preventing incorrect error handling or retry behavior.

Bug 2: Only log re-deliveries when total_attempts > retry_count + 1,
not just when total_attempts is not None. This prevents spamming logs
with 're-delivery detected' messages for every normal retry and only
logs actual visibility timeout re-deliveries.
@drazisil-codecov
Copy link
Contributor Author

@sentry review

- Add try-except for int() conversion in _get_total_attempts() to handle
  malformed header values gracefully
- Add try-except for int() conversion in re-delivery detection to prevent
  crashes on invalid header values
- Add breadcrumb notification on final failure in upload_finisher when
  max retries exceeded
- Update safe_retry() docstring to clarify that Retry exception propagation
  is intentional and return True is unreachable by design
Copy link
Contributor

@thomasrockhu-codecov thomasrockhu-codecov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two main points

(1) there is a lot of code here that needs to be refactored, please review it
(2) the use of +1 everywhere is definitely a code smell, please take a look to see if it's really necessary

Total number of attempts (retries + re-deliveries)
"""
if not hasattr(self, "request"):
return 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should return None instead no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because this is an int counter, and I don't think that None can be incremented. I think we want 0 in this case. Unless python hates that.

# Use header value if it exists (includes re-deliveries)
try:
return int(total_attempts_header)
except (ValueError, TypeError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems... very protective. can we do the check earlier with how total_attempts_header is called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to identify a place. I think it's only used here. However, we are editing a proporty that isn't always set, and isn't generally modified outside of celery itself.

Comment on lines +835 to +836
task.request.get = lambda key, default=None: {} if key == "headers" else default
task.request.headers = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this doing? seems more complicated than it needs to be for test setup

if task and task.request:
log_context.task_name = task.name
log_context.task_id = task.request.id
task_id = getattr(task.request, "id", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure that task.request.id can be None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not positive, better safe then sorry, unless you would like the check removed

upload.state_id = UploadState.ERROR.db_id
upload.state = "error"
try:
db_session.commit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can db_session.commit() not be in the finally block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want it to always commit. I think that it isn't in some cases, but where would you prefer it be?

@drazisil-codecov
Copy link
Contributor Author

@sentry review

Comment on lines +297 to +315
extra={"value": attempts_header, "retry_count": retry_count},
)
return retry_count + 1
return getattr(self.request, "retries", 0) + 1

Returns:
True if retry was scheduled
False if max retries exceeded
def _has_exceeded_max_attempts(self, max_retries: int | None) -> bool:
"""Check if task has exceeded max attempts (including re-deliveries)."""
if max_retries is None:
return False

Example:
if some_condition_requires_retry:
if not self.safe_retry(max_retries=5, countdown=60):
# Max retries exceeded
log.error("Giving up after too many retries")
return {"success": False, "reason": "max_retries"}
max_attempts = max_retries + 1
return self.attempts >= max_attempts

def safe_retry(self, max_retries=None, countdown=None, exc=None, **kwargs):
"""Safely retry with max retry limit and proper metrics tracking.
Returns False if max retries exceeded, otherwise raises Retry exception.
Unlike self.retry(), this checks max attempts BEFORE retrying and returns
False instead of raising MaxRetriesExceededError.

This comment was marked as outdated.

Comment on lines +342 to 345
},
tags={"error_type": "max_retries_exceeded", "task": self.name},
)
return False

This comment was marked as outdated.

Comment on lines +185 to +192
},
tags={
"error_type": "lock_max_retries_exceeded",
"lock_name": lock_name,
"lock_type": lock_type.value,
},
)
# TODO: should we raise this, or would a return be ok?

This comment was marked as outdated.

Comment on lines 77 to +106
previous_result,
)
except LockRetry as retry:
self.retry(countdown=retry.countdown)
if self._has_exceeded_max_attempts(self.max_retries):
attempts = self.attempts
max_attempts = self.max_retries + 1
log.error(
"Bundle analysis processor exceeded max retries",
extra={
"attempts": attempts,
"commitid": commitid,
"max_attempts": max_attempts,
"max_retries": self.max_retries,
"repoid": repoid,
},
)
return previous_result
if not self.safe_retry(
max_retries=self.max_retries, countdown=retry.countdown
):
attempts = self.attempts
log.error(
"Failed to schedule retry for bundle analysis processor",
extra={
"attempts": attempts,
"commitid": commitid,
"repoid": repoid,
},
)
return previous_result

This comment was marked as outdated.

Comment on lines 224 to 230
"Attempting to retry bundle analysis upload",
extra={
"commitid": commitid,
"repoid": repoid,
"commit": commitid,
"commit_yaml": commit_yaml,
"params": params,
"result": result.as_dict(),

This comment was marked as outdated.

@@ -107,25 +143,64 @@
)

This comment was marked as outdated.

Comment on lines +268 to +285
exc_info=True,
)
headers = getattr(self.request, "headers", {})
return headers or {}

@property
def attempts(self) -> int:
"""Get attempts including re-deliveries from visibility timeout expiration.
Returns:
- Header value if present and valid (most accurate)
- retry_count + 1 if header missing/invalid (best guess based on retry count)
- 0 if request unavailable (rare, safe default for comparisons)
Returns int (not None) to be safe for comparisons and logging without null checks.
"""
Safely retry with max retry limit and proper metrics tracking.
if not hasattr(self, "request") or self.request is None:
return 0

This comment was marked as outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants