diff --git a/pkg/scheduler/DESIGN.md b/pkg/scheduler/DESIGN.md
new file mode 100644
index 00000000000..7d9c81781d3
--- /dev/null
+++ b/pkg/scheduler/DESIGN.md
@@ -0,0 +1,588 @@
+# Query Request Queue Design: Queue Splitting and Prioritization
+
+The `RequestQueue` subservice embedded into the scheduler process is responsible for
+all decisions regarding enqueuing and dequeuing of query requests.
+While the `RequestQueue`'s responsibilities are relatively broad, including management of
+querier-worker connection lifecycles and graceful startup/shutdown,
+the queuing logic is isolated into a "tree queue" structure and its associated queue algorithms.
+
+## Tree Queue: What and Why
+
+The "tree queue" structure builds a discrete priority queue.
+Requests are split into many queues; each request queue is located at a leaf node in the tree structure.
+
+The prioritization decisions required for the request queue carry constraints which lend themselves to a 
+search tree or decision tree structure:
+- Each decision is taken independently of the other, making its decision based on some state
+- The decisions are hierarchical, or ordered; the same set of request queues may present a different final decision
+  (request queue) depending on the order of decision execution
+- The second decision may return no valid queues. In this case, we defer back to first decision for a different result
+- If there are non-zero requests waiting to be served, and valid queriers available to serve those requests,
+  then these two decisions combined _must_ eventually produce a request queue.
+
+The ordered nature of the decision-making can be expressed as tree levels, 
+and the end goal of producing a request queue can be modeled as a depth-first search.
+
+### Diagram: Dequeue Decision Tree (Simplified)
+
+> [!NOTE]
+> The system maintains a fourth query component, `unknown`, which is treated the same as `ingester-and-store-gateway`.
+For diagrams in this doc, we omit the `unknown` query component node and its subtree for visual simplicity.
+
+
+```mermaid
+---
+title: Dequeue Decision Tree (Simplified)
+config:
+  theme: default
+---
+
+%%{init: {"flowchart": {"htmlLabels": false}} }%%
+
+graph TB
+    queryComponentAlgo["`
+       select query component node by querier-worker ID
+    `"]
+    style queryComponentAlgo fill:white,stroke:lightgray,stroke-dasharray:5
+
+
+    root(["`**root**
+
+    `"])
+
+    root~~~tenandShardAlgo["
+        select tenant by global tenant rotation
+    "]
+
+    style tenandShardAlgo fill:white,stroke:lightgray,stroke-dasharray:5
+
+    ingester(["`**ingester**`"])
+    storeGateway(["`**store-gateway**`"])
+
+    both(["`**ingester +
+
+    store-gateway**`"])
+
+    root<-->ingester
+    root<-->storeGateway
+    root<-->both
+
+    ingester-tenant1(["`**tenant-1**
+
+    [queue node]
+    `"])
+    ingester-tenant2(["`**tenant-2**
+
+    [queue node]
+    `"])
+
+    storeGateway-tenant1(["`**tenant-1**
+
+    [queue node]
+    `"])
+    storeGateway-tenant2(["`**tenant-2**
+
+    [queue node]
+    `"])
+
+    both-tenant1(["`**tenant-1**
+
+    [queue node]
+    `"])
+    both-tenant2(["`**tenant-2**
+
+    [queue node]
+    `"])
+
+    ingester<-->ingester-tenant1
+    ingester<-->ingester-tenant2
+
+
+    storeGateway<-->storeGateway-tenant1
+    storeGateway<-->storeGateway-tenant2
+
+    both<-->both-tenant1
+    both<-->both-tenant2
+
+```
+
+### Enqueuing to the Tree Queue
+
+On enqueue, we partition requests into separate simple queues based on two static properties of the query request:
+
+- the "expected query component"
+  - `ingester`, `store-gateway`, `ingester-and-store-gateway`, or `unknown`
+- the tenant ID of the request
+
+These properties are used to place the request into a queue at a leaf node.
+A request from `tenant-1` which is expected to only utilize ingesters
+will be enqueued at the leaf node with path `root -> ingester -> tenant-1`.
+
+### Dequeuing from the Tree Queue
+
+On dequeue, we perform a depth-first search of the tree to select a leaf node to dequeue from.
+Each of the two non-leaf levels of the tree uses a different algorithm to select the next child node.
+
+1. At the root node level, one algorithm selects one of four possible query component child nodes.
+2. At query component level, the other algorithm attempts to select a tenant-specific child node.
+   1. due to tenant-querier shuffle sharding, it is possible that none of the tenant nodes
+      can be selected for dequeuing for the current querier.
+3. If a tenant node is selected, the search dequeues from it, as it has reached a leaf node.
+4. If no tenant node is selected, the search returns back up to the root node level
+   and selects the next query component child node to continue the search from.
+
+### Diagram: Dequeue Decision Tree (Full)
+
+```mermaid
+---
+title: Dequeue Decision Tree (Full)
+config:
+  theme: default
+---
+
+%%{init: {"flowchart": {"htmlLabels": false}} }%%
+
+graph TB
+
+    request((("`
+    **dequeue request**
+    querierID
+    workerID
+    lastTenantIndex
+    `")))
+
+    subgraph rootLayer ["**root layer**"]
+        queryComponentAlgo["`
+            **query component
+            node selection algorithm:**
+            select starting query component by querier-worker ID;
+            move on to next query component node only if DFS exhausted
+        `"]
+        style queryComponentAlgo fill:white,stroke:lightgray,stroke-dasharray:5
+
+    root(["**root**"])
+    end
+    request-->root
+
+    subgraph queryComponentLayer ["**query component layer**"]
+        tenandShardAlgo["`
+            **tenant shuffle shard
+            node selection algorithm:**
+            select next tenant in global rotation;
+            start after lastTenantIndex,
+            move on to next tenant node if tenant not sharded to querier
+        `"]
+        style tenandShardAlgo fill:white,stroke:lightgray,stroke-dasharray:5
+
+        ingester(["`**ingester**`"])
+        storeGateway(["`**store-gateway**`"])
+        both(["`**ingester + store-gateway**`"])
+    end
+
+        %% root~~~tqasDescribe
+        root-->|search subtree for next sharded tenant|ingester
+        ingester-->|no sharded tenant found|root
+
+        root-->|search subtree for next sharded tenant|storeGateway
+        storeGateway-->|no sharded tenant found|root
+
+        root-->|search subtree for next sharded tenant|both
+        both-->|no sharded tenant found|root
+
+
+    subgraph tenantLayer ["**tenant layer (leaf)**"]
+        leafAlgo["`
+        **dequeue:**
+        No further node selection at leaf node level;
+        any existing node has a non-empty queue
+        `"]
+        style leafAlgo fill:white,stroke:lightgray,stroke-dasharray:5
+
+        ingester-tenant1([**tenant-1**])
+        ingester-tenant2([**tenant-2**])
+
+        storeGateway-tenant1([**tenant-1**])
+        storeGateway-tenant2([**tenant-2**])
+
+        both-tenant1([**tenant-1**])
+        both-tenant3([**tenant-3**])
+    end
+
+    ingester-->|sharded to querierID?|ingester-tenant1
+    ingester-tenant1-->|no|ingester
+    ingester-->|sharded to querierID?|ingester-tenant2
+    ingester-tenant2-->|no|ingester
+
+    storeGateway-->|sharded to querierID?|storeGateway-tenant1
+    storeGateway-tenant1-->|no|storeGateway
+    storeGateway-->|sharded to querierID?|storeGateway-tenant2
+    storeGateway-tenant2-->|no|storeGateway
+
+    both-->|sharded to querierID?|both-tenant1
+    both-tenant1-->|no|both
+    both-->|sharded to querierID?|both-tenant3
+    both-tenant3-->|no|both
+
+    dequeue-ingester-tenant1[[dequeue for ingester,  tenant1]]
+    dequeue-ingester-tenant2[[dequeue for ingester, tenant2]]
+    ingester-tenant1-->|yes|dequeue-ingester-tenant1
+    ingester-tenant2-->|yes|dequeue-ingester-tenant2
+
+    dequeue-storeGateway-tenant1[[dequeue for store-gateway, tenant1]]
+    dequeue-storeGateway-tenant2[[dequeue for store-gateway, tenant2]]
+    storeGateway-tenant1-->|yes|dequeue-storeGateway-tenant1
+    storeGateway-tenant2-->|yes|dequeue-storeGateway-tenant2
+
+    dequeue-both-tenant1[[dequeue for ingester+store-gateway,  tenant1]]
+    dequeue-both-tenant3[[dequeue for ingester+store-gateway, tenant3]]
+    both-tenant1-->|yes|dequeue-both-tenant1
+    both-tenant3-->|yes|dequeue-both-tenant3
+```
+
+## Deep Dive: Queue Selection Algorithms
+
+### Context & Requirements
+
+#### Original State: Queue Splitting by Tenant
+
+The `RequestQueue` originally only split queues by tenant, with two goals in mind:
+
+1. tenant fairness via a simple round-robin between all tenants with non-empty query request queues
+2. rudimentary tenant isolation via shuffle-shard assignment of noisy tenants to only a subset of queriers
+
+While this inter-tenant Quality-Of-Service approach has worked well,
+other QoS issues have arisen from the varying characteristics of Mimir's two "query components" -- 
+components that the queriers fetch TSDB data from in order to execute queries: ingesters and store-gateways.
+
+#### New Requirement: Queue Splitting by Query Component
+
+Ingesters serve requests for recent data, and store-gateways serve requests for older data.
+While queries can span the time periods served by both query components,
+many requests are served by only one of the two components.
+
+Ingesters and store-gateways tend to experience issues independently of each other,
+but when one component was in a degraded state, _all_ queries would wait in the queue behind the slow queries,
+regardless of which query component they required, 
+causing high latency and timeouts for queries which could have been serviced by the non-degraded query component.
+
+
+#### New Requirement: Ordered Decision-Making
+
+Because all requests will now be split across two dimensions instead of one,
+it matters which dimension is considered first. Earlier in this project, we optimized for simplicity
+by implementing the additional query-component queue splitting as a decision taken after choosing a tenant.
+As a result, because any given tenant in the queue was guaranteed to have some request(s) queued,
+we always dequeued a request for the next tenant in the queue, 
+even if that tenant only had requests queued for a degraded component.
+
+Thus, the decision of "which query component to dequeue a request for?" 
+must come _before_ the decision of which tenant to dequeue a request for. 
+
+#### New Requirement: Query Component Prioritization
+
+A vanilla round-robin algorithm does not sufficiently guard against a high-latency component
+saturating all or nearly all connections with inflight requests for that component.
+Despite rotating which query component is dequeued for, utilization of the querier-worker connection pool
+as measured by inflight query processing time will grow asymptotically to be dominated by the slow query component.
+In some cases, we may reach this state even faster than in the simple (round-robin across tenant queues) case.
+This is because every querier has, at worst, a 75% chance of dequeuing a request for a degraded component
+while there are still queries available for non-degraded components.
+
+Therefore, we are required to make some prioritization decisions about query components 
+to keep dequeuing queries for non-degraded components wherever possible.
+
+### Query Component Selection to Solve Processing Time Dominance by Slow Queries
+
+#### Modeling the Problem
+
+To demonstrate the issue, we can simplify the system to two query components and four querier connections.
+Queries to the "slow" query component take 8 ticks to process while queries to the "fast" query component take 1 tick.
+The round-robin selection advances from fast to slow or vice versa with each dequeue.
+
+In 64 ticks (16 each for 4 connections), the system:
+
+- dequeues and starts processing 16 queries: 8 fast, 8 slow
+- completes processing 13 queries: 8 fast, 5 slow
+- spends 8 ticks processing the fast queries
+- spends 56 ticks processing the slow queries
+
+##### Diagram: Query Processing Time Utilization with Round-Robin
+
+```mermaid
+---
+title: Query Processing Time Utilization with Round-Robin
+config:
+  gantt:
+    displayMode: compact
+    numberSectionStyles: 2
+  theme: default
+---
+gantt
+    dateFormat ss
+    axisFormat %S
+    tickInterval 1second
+
+    section consumer-1
+        fast        :active, c1-1, 00, 1s
+        fast        :active, c1-2, 01, 1s
+        fast        :active, c1-3, 02, 1s
+        slow        :done, c1-4, 03, 8s
+        slow...     :done, c1-5, 11, 5s
+
+    section consumer-2
+        slow        :done, c2-1, 00, 8s
+        fast        :active, c2-2, 08, 1s
+        fast        :active, c2-3, 09, 1s
+        fast        :active, c2-4, 10, 1s
+        fast        :active, c2-5, 11, 1s
+        slow...     :done, c2-6, 12, 4s
+
+    section consumer-3
+        fast        :active, c3-1, 00, 1s
+        slow        :done, c3-2, 01, 8s
+        slow...     :done, c3-3, 09, 7s
+
+    section consumer-4
+        slow        :done, c4-1, 00, 8s
+        slow        :done, c4-2, 08, 8s
+
+```
+
+<!--The structure had a simple hashmap mapping tenant IDs to a queue,-->
+<!--and rotated through a global list of active tenantIDs.-->
+<!--to select the next tenant sharded to the waiting querier.-->
+
+### Solution: Query Component Partitioning by Querier-Worker
+
+This solution is inspired by 
+[Two-Dimensional Fair Queuing for Multi-Tenant Cloud Services](https://people.mpi-sws.org/~jcmace/papers/mace20162dfq.pdf).
+
+Querier-worker connections are given IDs, 
+and partitioned evenly across up to four possible query-component nodes
+via  a modulo operation: `querier-worker connection ID % number of query-component nodes`.
+
+Ex:
+Assume a query component node order of `[ingester, store-gateway, ingester-and-store-gateway, unknown]`.
+
+- querier-worker connection IDs `0`, `4`, `8`, etc. would be assigned to `ingester`
+- querier-worker connection IDs `1`, `5`, `9`, etc. would be assigned to `store-gateway`
+- etc. for `ingester-and-store-gateway`, and `unknown`
+
+We conservatively expect degradation of the store-gateway query component will cause high latency
+for the queries in the `store-gateway`, `ingester-and-store-gateway`, and `unknown` queues. 
+By partitioning the querier-worker connections evenly across the four queues,
+25% of connections remain "reserved" to process queries from the `ingester` queue.
+
+The primary measure of success is the servicing of the queries to the non-degraded query component,
+In real-world scenarios the slow queries are often slow enough to hit timeouts,
+and the majority of those queries will be expected to fail until the component recovers.
+
+A secondary measure of success is the continued utilization of queriers while there are still any requests in the queue. 
+The modulo operation described above supports this; if, in the example above, 
+we exhaust the `ingester` queue, it will be removed and querier-worker connections will be distributed amongst the remaining three queues as they become available again.  
+
+#### Modeling the Solution
+
+Again we simplify the system to two query components and four querier connections.
+Queries to the "slow" query component take 8 ticks to process while queries to the "fast" query component take 1 tick.
+
+In 64 ticks (16 each for 4 connections), the new system:
+
+- dequeues and starts processing 36 queries: 32 fast, 4 slow
+- completes processing 36 queries: 32 fast, 4 slow
+- spends 32 ticks processing the fast queries
+- spends 32 ticks processing the slow queries
+
+Compare with the `Query Processing Time Utilization with Round-Robin` results and diagram above.
+The new system allocates more query processing time to the fast queries
+and completes 8x more fast queries than the original system in the same time period.
+
+##### Diagram: Query Processing Time Utilization with Querier-Worker Queue Prioritization
+
+```mermaid
+---
+title: Query Processing Time Utilization with Querier-Worker Queue Prioritization
+config:
+  gantt:
+    displayMode: compact
+    numberSectionStyles: 2
+  theme: default
+---
+gantt
+    dateFormat ss
+    axisFormat %S
+    tickInterval 1second
+
+    section consumer-1
+        fast        :active, c1-1, 00, 1s
+        fast        :active, c1-2, 01, 1s
+        fast        :active, c1-3, 02, 1s
+        fast        :active, c1-4, 03, 1s
+        fast        :active, c1-5, 04, 1s
+        fast        :active, c1-6, 05, 1s
+        fast        :active, c1-7, 06, 1s
+        fast        :active, c1-8, 07, 1s
+        fast        :active, c1-9, 08, 1s
+        fast        :active, c1-10, 09, 1s
+        fast        :active, c1-11, 10, 1s
+        fast        :active, c1-12, 11, 1s
+        fast        :active, c1-13, 12, 1s
+        fast        :active, c1-14, 13, 1s
+        fast        :active, c1-15, 14, 1s
+        fast        :active, c1-16, 15, 1s
+
+    section consumer-2
+        fast        :active, c2-1, 00, 1s
+        fast        :active, c2-2, 01, 1s
+        fast        :active, c2-3, 02, 1s
+        fast        :active, c2-4, 03, 1s
+        fast        :active, c2-5, 04, 1s
+        fast        :active, c2-6, 05, 1s
+        fast        :active, c2-7, 06, 1s
+        fast        :active, c2-8, 07, 1s
+        fast        :active, c2-9, 08, 1s
+        fast        :active, c2-10, 09, 1s
+        fast        :active, c2-11, 10, 1s
+        fast        :active, c2-12, 11, 1s
+        fast        :active, c2-13, 12, 1s
+        fast        :active, c2-14, 13, 1s
+        fast        :active, c2-15, 14, 1s
+        fast        :active, c2-16, 15, 1s
+
+
+    section consumer-3
+        slow        :done, c3-1, 00, 8s
+        slow        :done, c3-2, 08, 8s
+
+    section consumer-4
+        slow        :done, c4-1, 00, 8s
+        slow        :done, c4-2, 08, 8s
+
+```
+
+#### Caveats: Corner Cases and Things to Know
+
+##### Distribution of Querier-Worker Connections Across Query Component Nodes
+**If there are fewer than 4 querier-worker connections to the request queue, a query-component
+node can be starved of connections.**
+To prevent this, the querier has been updated to create at least 4 connections to each scheduler,
+ignoring any `-querier.max-concurrent` value below 4.
+
+**When the total number of querier-worker connections is not evenly divisible by the number of query component nodes,
+the modulo distribution will be uneven, with some nodes being assigned one extra connection**.
+This is not an issue.
+Queue nodes are deleted as queues are cleared, then recreated in whichever order new queries arrive.
+As the node count and order changes over time, the node(s) which receive the extra connections are naturally shuffled.
+
+##### Empty Queue Node Deletion Can Cause Temporary Starvation
+
+As mentioned above, when a queue node is emptied, it is deleted from the tree structure
+and cannot be selected by the queue selection algorithms.
+This can result in the following scenario:
+
+1. Queries to store-gateways are experiencing high latency, causing backup
+   in the `store-gateway`, `ingester-and store-gateway`, and `unknown queues`.
+2. The ingester-only queries continue to be dequeued and processed by 1/4 of the querier-worker connections.
+3. The ingester-only queue is exhausted and the `ingester` node is deleted from the tree.
+4. The querier-worker connections are now evenly distributed across the remaining three nodes,
+   and _all_ connections are now stuck working on slow queries touching the degraded store-gateways.
+5. More ingester-only queries arrive and are enqueued at the `ingester` node,
+   but no querier-worker connections are available to dequeue them.
+
+This scenario is not desirable, but it is considered an acceptable tradeoff against the alternatives.
+As soon as the connections which would be partitioned to the `ingester` node become available again,
+they will return to working on the ingester-only queries.
+
+Modeling a simplified system again shows that this scenario still improves on the previous state.
+
+If the fast query queue is cleared and deleted before tick 4 and created again after tick 5,
+the fast-query queue consumers will have dequeued slow queries at tick 4 and work on them until tick 12.
+At tick 12 they return to being dedicated solely to the fast-query queue.
+
+Despite the temporary fast queue starvation, in 64 ticks (16 each for 4 connections), the new system
+still dequeues, starts, and completes processing 22 queries (16 fast, 6 slow) -
+still 2x more fast queries than the original system in the same time period.
+
+##### Diagram: Temporary Fast Queue Starvation with Querier-Worker Queue Prioritization
+
+```mermaid
+---
+title: Temporary Fast Queue Starvation with Querier-Worker Queue Prioritization
+config:
+  gantt:
+    displayMode: compact
+    numberSectionStyles: 2
+  theme: default
+---
+gantt
+    dateFormat ss
+    axisFormat %S
+    tickInterval 1second
+
+    section consumer-1
+        fast        :active, c1-1, 00, 1s
+        fast        :active, c1-2, 01, 1s
+        fast        :active, c1-3, 02, 1s
+        fast        :active, c1-4, 03, 1s
+        slow        :done, c1-5, 04, 8s
+        fast        :active, c1-6, 12, 1s
+        fast        :active, c1-7, 13, 1s
+        fast        :active, c1-8, 14, 1s
+        fast        :active, c1-9, 15, 1s
+
+    section consumer-2
+        fast        :active, c2-1, 00, 1s
+        fast        :active, c2-2, 01, 1s
+        fast        :active, c2-3, 02, 1s
+        fast        :active, c2-4, 03, 1s
+        slow        :done, c2-5, 04, 8s
+        fast        :active, c2-6, 12, 1s
+        fast        :active, c2-7, 13, 1s
+        fast        :active, c2-8, 14, 1s
+        fast        :active, c2-9, 15, 1s
+
+    section consumer-3
+        slow        :done, c3-1, 00, 8s
+        slow        :done, c3-2, 08, 8s
+
+    section consumer-4
+        slow        :done, c4-1, 00, 8s
+        slow        :done, c4-2, 08, 8s
+
+```
+
+### Benchmarks and Simulation
+
+The new solution was tested using both Go benchmarks and simulation in a live Mimir cluster.
+
+The measure of comparison is the time spent in queue by the queries for the non-degraded query component.
+All test scenarios slowed down the store-gateway queries and measured the time spent in queue by ingester-only queries.
+
+#### Simulation in a Live Cluster with Load Generation
+
+A 60-minute simulation was run with the following parameters:
+
+- 10k queries per second, split 50 / 50 between ingesters and store-gateways
+- 10-second artificial slowdown in the store-gateways created with a sleep in the `Series` method
+- start at 19:05
+- complete at 20:05
+- scheduler restarted to enable the new queue selection algorithm at the midpoint, 19:35
+
+Before 19:35, the queue latency for ingester-only queries approached that of the store-gateway queries.
+
+As the queue backed up, both queries for all query components had:
+
+- 99th percentile time in queue near 2 minutes
+- 50th percentile time in queue over 1 minute
+
+After the new solution was enabled at 19:35, ingester-only queries were able to be dequeued and processed
+independently of the store-gateway queries, and queue latency for the ingester-only queries dropped significantly.
+
+While queue latency for store-gateway queries remained steadily high, ingester-only queries had:
+
+- 99th percentile time in queue around 2 seconds
+- 50th percentile time in queue around 200 ms
+
+The Mimir / Reads dashboard sections for the Query Scheduler illustrate the impact of the new solution:
+
+![image Live Simulation](./images/request-queue-design-live-simulation.png)
diff --git a/pkg/scheduler/images/request-queue-design-live-simulation.png b/pkg/scheduler/images/request-queue-design-live-simulation.png
new file mode 100644
index 00000000000..54d29a532a0
Binary files /dev/null and b/pkg/scheduler/images/request-queue-design-live-simulation.png differ