Additional tweaks to task retry docs (PrefectHQ#9575)

john-jam · May 23, 2023 · 4c56a9a · 4c56a9a
1 parent 9c6adc8
commit 4c56a9a
Showing 1 changed file with 49 additions and 12 deletions.
diff --git a/docs/concepts/tasks.md b/docs/concepts/tasks.md
@@ -204,29 +204,61 @@ def my_flow():
 
 ## Retries
 
-Prefect tasks can automatically retry on failure. To enable retries, pass `retries` and `retry_delay_seconds` parameters to your task.
+Prefect can automatically retry tasks on failure. In Prefect, a task _fails_ if
+its Python function raises an exception.
 
-For example, let's say you need to retrieve data from a brittle API:
+To enable retries, pass `retries` and `retry_delay_seconds` parameters to your
+task. If the task fails, Prefect will retry it up to `retries` times, waiting
+`retry_delay_seconds` seconds between each attempt. If the task fails on the
+final retry, Prefect marks the task as _crashed_ if the task raised an exception
+or _failed_ if it returned a string.
+
+!!! note "Retries don't create new task runs"
+    A new task run is not created when a task is retried. A new state is added to the state history of the original task run.
+
+
+### A real-world example: making an API request
+
+Consider the real-world problem of making an API request. In this example,
+we'll use the [`httpx`](https://www.python-httpx.org/) library to make an HTTP
+request.
 
 ```python hl_lines="4"
+import requests
 import httpx
-from prefect import task, flow
+
+from prefect import flow, task
+
 
 @task(retries=2, retry_delay_seconds=5)
-def get_data(
+def get_data_task(
     url: str = "https://api.brittle-service.com/endpoint"
 ) -> dict:
     response = httpx.get(url)
+
+    # If the response status code is anything but a 2xx, httpx will raise
+    # an exception. This task doesn't handle the exception, so Prefect will
+    # catch the exception and will consider the task run failed.
     response.raise_for_status()
+
     return response.json()
+
+
+@flow
+def get_data_flow():
+    get_data_task()
 ```
 
-If your task gets a bad response, `get_data` will automatically retry twice, waiting 5 seconds in between retries.
+In this task, if the HTTP request to the brittle API receives any status code
+other than a 2xx (200, 201, etc.), Prefect will retry the task a maximum of two
+times, waiting five seconds in between retries.
+
+### Custom retry behavior
 
 The `retry_delay_seconds` option accepts a list of delays for more custom retry behavior. The following task will wait for successively increasing intervals of 1, 10, and 100 seconds, respectively, before the next attempt starts:
 
 ```python
-from prefect import task, flow
+from prefect import task
 
 @task(retries=3, retry_delay_seconds=[1, 10, 100])
 def some_task_with_manual_backoff_retries():
@@ -236,18 +268,24 @@ def some_task_with_manual_backoff_retries():
 Additionally, you can pass a callable that accepts the number of retries as an argument and returns a list. Prefect includes an [`exponential_backoff`](/api-ref/prefect/tasks/#prefect.tasks.exponential_backoff) utility that will automatically generate a list of retry delays that correspond to an exponential backoff retry strategy. The following flow will wait for 10, 20, then 40 seconds before each retry.
 
 ```python
-from prefect import task, flow
+from prefect import task
 from prefect.tasks import exponential_backoff
 
 @task(retries=3, retry_delay_seconds=exponential_backoff(backoff_factor=10))
 def some_task_with_exponential_backoff_retries():
    ...
 ```
 
-While using exponential backoff you may also want to jitter the delay times to prevent "thundering herd" scenarios, where many tasks all retry at exactly the same time, causing cascading failures. The `retry_jitter_factor` option can be used to add variance to the base delay. For example, a retry delay of 10 seconds with a `retry_jitter_factor` of 0.5 will be allowed to delay up to 15 seconds. Large values of `retry_jitter_factor` provide more protection against "thundering herds", while keeping the average retry delay time constant. For example, the following task adds jitter to its exponential backoff so the retry delays will vary up to a maximum delay time of 20, 40, and 80 seconds respectively.
+#### Advanced topic: adding "jitter"
+
+While using exponential backoff, you may also want to add _jitter_ to the delay times. Jitter is
+a random amount of time added to retry periods that helps prevent "thundering herd" scenarios, which
+is when many tasks all retry at the exact same time, potentially overwhelming systems.
+
+The `retry_jitter_factor` option can be used to add variance to the base delay. For example, a retry delay of 10 seconds with a `retry_jitter_factor` of 0.5 will be allowed to delay up to 15 seconds. Large values of `retry_jitter_factor` provide more protection against "thundering herds," while keeping the average retry delay time constant. For example, the following task adds jitter to its exponential backoff so the retry delays will vary up to a maximum delay time of 20, 40, and 80 seconds respectively.
 
 ```python
-from prefect import task, flow
+from prefect import task
 from prefect.tasks import exponential_backoff
 
 @task(
@@ -259,6 +297,8 @@ def some_task_with_exponential_backoff_retries():
    ...
 ```
 
+### Configuring retry behavior globally with settings
+
 You can also set retries and retry delays by using the following global settings. These settings will not override the `retries` or `retry_delay_seconds` that are set in the flow or task decorator. 
 
 ```
@@ -268,9 +308,6 @@ prefect config set PREFECT_FLOW_DEFAULT_RETRY_DELAY_SECONDS = [1, 10, 100]
 prefect config set PREFECT_TASK_DEFAULT_RETRY_DELAY_SECONDS = [1, 10, 100]
 ```
 
-!!! note "Retries don't create new task runs"
-    A new task run is not created when a task is retried. A new state is added to the state history of the original task run.
-
 ## Caching
 
 Caching refers to the ability of a task run to reflect a finished state without actually running the code that defines the task. This allows you to efficiently reuse results of tasks that may be expensive to run with every flow run, or reuse cached results if the inputs to a task have not changed.