CCMRG-1914: Fix infinite retry loop bug with visibility timeout re-deliveries #598

drazisil-codecov · 2025-12-04T22:20:09Z

Problem

Tasks were getting stuck in infinite retry loops at retry 5 and never exceeding max retries. This happened when:

Task reaches retry 5 (request.retries = 5)
Visibility timeout expires before task completes
Redis re-queues the task
Celery treats it as a new delivery (not a retry)
request.retries doesn't increment
Task stays at retry 5 and loops forever

Additionally, bundle analysis processor tasks were completing successfully but uploads weren't being marked as complete, causing infinite redispatches. This occurred because:

update_upload() only called db_session.flush() but never db_session.commit()
If visibility timeout expired before wrap_up_dbsession() ran, the transaction was lost
Upload state changes were never persisted to the database
Uploads remained in "pending" state and kept getting redispatched

Solution

Track total attempts in task headers (including re-deliveries from visibility timeout expiration)
Check both retries and total attempts before retrying to prevent infinite loops
Enhanced LockManager to properly log errors and send Sentry exceptions when max retries are hit
Add explicit db_session.commit() in bundle analysis processor to persist upload state immediately

Changes

Core Fixes

Added _get_total_attempts() method to track total attempts including re-deliveries
Updated safe_retry() to check both retry count and total attempts
Initialize total_attempts=1 in task headers when tasks are created
Increment total_attempts in headers when retrying
Updated upload_finisher.py to use new retry logic

Bundle Analysis Processor Fix

Added explicit db_session.commit() after update_upload() to persist upload state
Added explicit db_session.commit() in error handler to persist error state
Fixed finally block to safely handle cases where result might not be defined

LockManager Enhancements

Added Sentry exception reporting when max retries exceeded
Enhanced error logging with full context (repoid, commitid, lock_name, report_type, etc.)
Added proper error tags for filtering in Sentry

Tests

Fixed all failing tests to account for new total_attempts header
Updated test assertions to match new retry logic

Testing

All existing tests pass
New logic prevents infinite loops by tracking total attempts
LockManager properly reports failures to Sentry
Bundle analysis processor now persists upload state even if task is re-delivered

Fixes CCMRG-1914

… timeout re-deliveries

linear · 2025-12-04T22:20:12Z

CCMRG-1914 Fix infinite retry bug

apps/worker/tasks/base.py

sentry · 2025-12-04T22:27:56Z

Codecov Report

❌ Patch coverage is 77.11864% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.83%. Comparing base (6db2ae8) to head (905730a).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
apps/worker/tasks/bundle_analysis_processor.py	51.85%	13 Missing ⚠️
apps/worker/tasks/upload_finisher.py	20.00%	8 Missing ⚠️
apps/worker/tasks/base.py	90.16%	6 Missing ⚠️

❌ Your patch check has failed because the patch coverage (77.11%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #598      +/-   ##
==========================================
- Coverage   93.87%   93.83%   -0.05%     
==========================================
  Files        1284     1284              
  Lines       46667    46751      +84     
  Branches     1522     1522              
==========================================
+ Hits        43809    43867      +58     
- Misses       2548     2574      +26     
  Partials      310      310

Flag	Coverage Δ
workerintegration	`58.46% <30.50%> (-0.14%)`	⬇️
workerunit	`91.09% <77.11%> (-0.12%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

codecov-eu · 2025-12-04T22:27:56Z

Codecov Report

❌ Patch coverage is 77.11864% with 27 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
apps/worker/tasks/bundle_analysis_processor.py	51.85%	13 Missing ⚠️
apps/worker/tasks/upload_finisher.py	20.00%	8 Missing ⚠️
apps/worker/tasks/base.py	90.16%	6 Missing ⚠️

📢 Thoughts on this report? Let us know!

codecov-notifications · 2025-12-04T22:28:13Z

Codecov Report

❌ Patch coverage is 77.11864% with 27 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
apps/worker/tasks/bundle_analysis_processor.py	51.85%	13 Missing ⚠️
apps/worker/tasks/upload_finisher.py	20.00%	8 Missing ⚠️
apps/worker/tasks/base.py	90.16%	6 Missing ⚠️

📢 Thoughts on this report? Let us know!

…le analysis processor The bundle_analysis_processor was calling update_upload() which only flushed changes to the database session but never committed them. This meant that if the task was re-delivered due to visibility timeout expiration, the uncommitted transaction would be lost, causing the upload state to not persist and leading to infinite redispatches. This fix adds explicit db_session.commit() calls: - After successful upload processing to persist the 'processed' state - After error handling to persist the 'error' state - Also fixes the finally block to safely handle cases where result might not be defined if an exception occurs before processing

… detection Bug 1: Wrap db_session.commit() in try-except to preserve original exception if commit fails. This ensures the original exception is always re-raised, preventing incorrect error handling or retry behavior. Bug 2: Only log re-deliveries when total_attempts > retry_count + 1, not just when total_attempts is not None. This prevents spamming logs with 're-delivery detected' messages for every normal retry and only logs actual visibility timeout re-deliveries.

drazisil-codecov · 2025-12-05T13:00:50Z

@sentry review

apps/worker/tasks/base.py

apps/worker/tasks/upload_finisher.py

apps/worker/services/lock_manager.py

- Add try-except for int() conversion in _get_total_attempts() to handle malformed header values gracefully - Add try-except for int() conversion in re-delivery detection to prevent crashes on invalid header values - Add breadcrumb notification on final failure in upload_finisher when max retries exceeded - Update safe_retry() docstring to clarify that Retry exception propagation is intentional and return True is unreachable by design

apps/worker/tasks/upload_finisher.py

thomasrockhu-codecov

Two main points

(1) there is a lot of code here that needs to be refactored, please review it
(2) the use of +1 everywhere is definitely a code smell, please take a look to see if it's really necessary

apps/worker/services/lock_manager.py

thomasrockhu-codecov · 2025-12-05T14:02:10Z

apps/worker/tasks/base.py

+            Total number of attempts (retries + re-deliveries)
+        """
+        if not hasattr(self, "request"):
+            return 0


I feel like this should return None instead no?

No, because this is an int counter, and I don't think that None can be incremented. I think we want 0 in this case. Unless python hates that.

apps/worker/tasks/base.py

thomasrockhu-codecov · 2025-12-05T14:04:11Z

apps/worker/tasks/base.py

+            # Use header value if it exists (includes re-deliveries)
+            try:
+                return int(total_attempts_header)
+            except (ValueError, TypeError):


this seems... very protective. can we do the check earlier with how total_attempts_header is called?

I wasn't able to identify a place. I think it's only used here. However, we are editing a proporty that isn't always set, and isn't generally modified outside of celery itself.

apps/worker/tasks/base.py

apps/worker/tasks/bundle_analysis_processor.py

apps/worker/services/lock_manager.py

thomasrockhu-codecov · 2025-12-05T15:59:08Z

apps/worker/tasks/tests/unit/test_upload_finisher_task.py

+        task.request.get = lambda key, default=None: {} if key == "headers" else default
+        task.request.headers = {}


what is this doing? seems more complicated than it needs to be for test setup

apps/worker/tasks/base.py

…_attempts'

… tracking

…acking

…etry logic

apps/worker/tasks/bundle_analysis_processor.py

apps/worker/tasks/base.py

apps/worker/services/lock_manager.py

apps/worker/tasks/base.py

thomasrockhu-codecov · 2025-12-05T18:37:08Z

apps/worker/tasks/base.py

            if task and task.request:
                log_context.task_name = task.name
-                log_context.task_id = task.request.id
+                task_id = getattr(task.request, "id", None)


are we sure that task.request.id can be None?

Not positive, better safe then sorry, unless you would like the check removed

apps/worker/tasks/base.py

thomasrockhu-codecov · 2025-12-05T18:38:47Z

apps/worker/tasks/bundle_analysis_processor.py

            upload.state_id = UploadState.ERROR.db_id
            upload.state = "error"
+            try:
+                db_session.commit()


can db_session.commit() not be in the finally block?

We want it to always commit. I think that it isn't in some cases, but where would you prefer it be?

apps/worker/tasks/bundle_analysis_processor.py

… tasks

…r and enhance its initialization with detailed parameters

apps/worker/tasks/bundle_analysis_processor.py

apps/worker/services/lock_manager.py

…ask management

apps/worker/tasks/base.py

apps/worker/tasks/bundle_analysis_processor.py

…s and improve documentation

…nalysis processing

drazisil-codecov · 2025-12-08T14:19:28Z

@sentry review

apps/worker/tasks/base.py

+                    extra={"value": attempts_header, "retry_count": retry_count},
+                )
+            return retry_count + 1
+        return getattr(self.request, "retries", 0) + 1

-        Returns:
-            True if retry was scheduled
-            False if max retries exceeded
+    def _has_exceeded_max_attempts(self, max_retries: int | None) -> bool:
+        """Check if task has exceeded max attempts (including re-deliveries)."""
+        if max_retries is None:
+            return False

-        Example:
-            if some_condition_requires_retry:
-                if not self.safe_retry(max_retries=5, countdown=60):
-                    # Max retries exceeded
-                    log.error("Giving up after too many retries")
-                    return {"success": False, "reason": "max_retries"}
+        max_attempts = max_retries + 1
+        return self.attempts >= max_attempts
+
+    def safe_retry(self, max_retries=None, countdown=None, exc=None, **kwargs):
+        """Safely retry with max retry limit and proper metrics tracking.
+
+        Returns False if max retries exceeded, otherwise raises Retry exception.
+        Unlike self.retry(), this checks max attempts BEFORE retrying and returns
+        False instead of raising MaxRetriesExceededError.


apps/worker/tasks/base.py

+                },
+                tags={"error_type": "max_retries_exceeded", "task": self.name},
+            )
            return False


apps/worker/services/lock_manager.py

+                    },
+                    tags={
+                        "error_type": "lock_max_retries_exceeded",
+                        "lock_name": lock_name,
+                        "lock_type": lock_type.value,
+                    },
+                )
+                # TODO: should we raise this, or would a return be ok?


apps/worker/tasks/bundle_analysis_processor.py

                    previous_result,
                )
        except LockRetry as retry:
-            self.retry(countdown=retry.countdown)
+            if self._has_exceeded_max_attempts(self.max_retries):
+                attempts = self.attempts
+                max_attempts = self.max_retries + 1
+                log.error(
+                    "Bundle analysis processor exceeded max retries",
+                    extra={
+                        "attempts": attempts,
+                        "commitid": commitid,
+                        "max_attempts": max_attempts,
+                        "max_retries": self.max_retries,
+                        "repoid": repoid,
+                    },
+                )
+                return previous_result
+            if not self.safe_retry(
+                max_retries=self.max_retries, countdown=retry.countdown
+            ):
+                attempts = self.attempts
+                log.error(
+                    "Failed to schedule retry for bundle analysis processor",
+                    extra={
+                        "attempts": attempts,
+                        "commitid": commitid,
+                        "repoid": repoid,
+                    },
+                )
+                return previous_result


apps/worker/tasks/bundle_analysis_processor.py

                    "Attempting to retry bundle analysis upload",
                    extra={
+                        "commitid": commitid,
                        "repoid": repoid,
-                        "commit": commitid,
                        "commit_yaml": commit_yaml,
                        "params": params,
                        "result": result.as_dict(),


apps/worker/services/lock_manager.py

@@ -107,25 +143,64 @@
            )


apps/worker/tasks/base.py

+                exc_info=True,
+            )
+        headers = getattr(self.request, "headers", {})
+        return headers or {}
+
+    @property
+    def attempts(self) -> int:
+        """Get attempts including re-deliveries from visibility timeout expiration.
+
+        Returns:
+            - Header value if present and valid (most accurate)
+            - retry_count + 1 if header missing/invalid (best guess based on retry count)
+            - 0 if request unavailable (rare, safe default for comparisons)
+
+        Returns int (not None) to be safe for comparisons and logging without null checks.
        """
-        Safely retry with max retry limit and proper metrics tracking.
+        if not hasattr(self, "request") or self.request is None:
+            return 0


…ith detailed retry information

Enhance task retry logic to track total attempts including visibility…

2c95b20

… timeout re-deliveries

sentry bot reviewed Dec 4, 2025

View reviewed changes

apps/worker/tasks/base.py Outdated Show resolved Hide resolved

drazisil-codecov enabled auto-merge December 5, 2025 02:18

drazisil-codecov requested review from a team and thomasrockhu-codecov December 5, 2025 02:18

drazisil-codecov disabled auto-merge December 5, 2025 12:48

drazisil-codecov added 5 commits December 5, 2025 07:49

fix: Add type cast for compare_sha parameter

4f984e4

style: Apply ruff formatting

8c0850d

style: Fix ruff check issues

48bf4ee

sentry bot reviewed Dec 5, 2025

View reviewed changes

apps/worker/tasks/base.py Outdated Show resolved Hide resolved

sentry bot reviewed Dec 5, 2025

View reviewed changes

apps/worker/tasks/base.py Outdated Show resolved Hide resolved

sentry bot reviewed Dec 5, 2025

View reviewed changes

apps/worker/tasks/base.py Outdated Show resolved Hide resolved

sentry bot reviewed Dec 5, 2025

View reviewed changes

apps/worker/tasks/upload_finisher.py Show resolved Hide resolved

sentry bot reviewed Dec 5, 2025

View reviewed changes

apps/worker/services/lock_manager.py Outdated Show resolved Hide resolved

sentry bot reviewed Dec 5, 2025

View reviewed changes

apps/worker/tasks/upload_finisher.py Outdated Show resolved Hide resolved

thomasrockhu-codecov requested changes Dec 5, 2025

View reviewed changes

feat: Enhance lock management and retry logic

d10f764

sentry bot reviewed Dec 5, 2025

View reviewed changes

apps/worker/tasks/bundle_analysis_processor.py Show resolved Hide resolved

refactor: Improve logging and retry handling in task management

9125115

sentry bot reviewed Dec 5, 2025

View reviewed changes

apps/worker/services/lock_manager.py Show resolved Hide resolved

drazisil-codecov requested a review from thomasrockhu-codecov December 5, 2025 15:40

fix: Ensure safe retrieval of request headers in BaseCodecovTask

6d7d906

thomasrockhu-codecov requested changes Dec 5, 2025

View reviewed changes

drazisil-codecov added 4 commits December 5, 2025 12:15

refactor: Update task retry logic to use 'attempts' instead of 'total…

c8191e0

…_attempts'

refactor: Enhance lock management and retry logic to include attempts…

7dc2a6f

… tracking

refactor: Streamline lock management and retry logic with attempts tr…

22866d8

…acking

refactor: Update lock management to consistently use 'attempts' for r…

9ca255c

…etry logic

drazisil-codecov requested a review from thomasrockhu-codecov December 5, 2025 18:19

sentry bot reviewed Dec 5, 2025

View reviewed changes

apps/worker/tasks/bundle_analysis_processor.py Show resolved Hide resolved

thomasrockhu-codecov requested changes Dec 5, 2025

View reviewed changes

drazisil-codecov added 2 commits December 5, 2025 13:47

refactor: Simplify retry logic in bundle analysis and upload finisher…

7d1588a

… tasks

refactor: Rename LockMaxRetriesExceeded to LockMaxRetriesExceededErro…

8a3b261

…r and enhance its initialization with detailed parameters

sentry bot reviewed Dec 5, 2025

View reviewed changes

apps/worker/tasks/bundle_analysis_processor.py Show resolved Hide resolved

apps/worker/services/lock_manager.py Show resolved Hide resolved

refactor: Improve upload type assignment and enhance retry logic in t…

1094ba4

…ask management

sentry bot reviewed Dec 8, 2025

View reviewed changes

apps/worker/tasks/base.py Outdated Show resolved Hide resolved

drazisil-codecov added 2 commits December 8, 2025 07:57

fix: corrected indent errorr

9cdd941

chore: lint

481e918

sentry bot reviewed Dec 8, 2025

View reviewed changes

apps/worker/tasks/bundle_analysis_processor.py Outdated Show resolved Hide resolved

drazisil-codecov added 3 commits December 8, 2025 08:27

Merge branch 'main' into joe/CCMRG-1914-fix-infinite-retry-bug

95e4342

refactor: Update safe_retry method to raise Retry exception on succes…

c1da000

…s and improve documentation

fix: Return previous results in case of early termination in bundle a…

ea70bd8

…nalysis processing

drazisil-codecov requested a review from thomasrockhu-codecov December 8, 2025 14:19

sentry bot reviewed Dec 8, 2025

View reviewed changes

refactor(lock_manager): Enhance LockMaxRetriesExceededError message w…

905730a

…ith detailed retry information

		task.request.get = lambda key, default=None: {} if key == "headers" else default
		task.request.headers = {}

CCMRG-1914: Fix infinite retry loop bug with visibility timeout re-deliveries #598

Are you sure you want to change the base?

CCMRG-1914: Fix infinite retry loop bug with visibility timeout re-deliveries #598

Uh oh!

Conversation

drazisil-codecov commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Changes

Core Fixes

Bundle Analysis Processor Fix

LockManager Enhancements

Tests

Testing

Uh oh!

linear bot commented Dec 4, 2025

Uh oh!

Uh oh!

sentry bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codecov-eu bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codecov-notifications bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

drazisil-codecov commented Dec 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasrockhu-codecov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

drazisil-codecov commented Dec 4, 2025 •

edited

Loading

sentry bot commented Dec 4, 2025 •

edited

Loading

codecov-eu bot commented Dec 4, 2025 •

edited

Loading

codecov-notifications bot commented Dec 4, 2025 •

edited

Loading