[SPARK-12757] Add block-level read/write locks to BlockManager #10705

JoshRosen · 2016-01-11T20:39:08Z

Motivation

As a pre-requisite to off-heap caching of blocks, we need a mechanism to prevent pages / blocks from being evicted while they are being read. With on-heap objects, evicting a block while it is being read merely leads to memory-accounting problems (because we assume that an evicted block is a candidate for garbage-collection, which will not be true during a read), but with off-heap memory this will lead to either data corruption or segmentation faults.

Changes

BlockInfoManager and reader/writer locks

This patch adds block-level read/write locks to the BlockManager. It introduces a new BlockInfoManager component, which is contained within the BlockManager, holds the BlockInfo objects that the BlockManager uses for tracking block metadata, and exposes APIs for locking blocks in either shared read or exclusive write modes.

BlockManager's get*() and put*() methods now implicitly acquire the necessary locks. After a get() call successfully retrieves a block, that block is locked in a shared read mode. A put() call will block until it acquires an exclusive write lock. If the write succeeds, the write lock will be downgraded to a shared read lock before returning to the caller. This put() locking behavior allows us store a block and then immediately turn around and read it without having to worry about it having been evicted between the write and the read, which will allow us to significantly simplify CacheManager in the future (see #10748).

See BlockInfoManagerSuite's test cases for a more detailed specification of the locking semantics.

Auto-release of locks at the end of tasks

Our locking APIs support explicit release of locks (by calling unlock()), but it's not always possible to guarantee that locks will be released prior to the end of the task. One reason for this is our iterator interface: since our iterators don't support an explicit close() operator to signal that no more records will be consumed, operations like take() or limit() don't have a good means to release locks on their input iterators' blocks. Another example is broadcast variables, whose block locks can only be released at the end of the task.

To address this, BlockInfoManager uses a pair of maps to track the set of locks acquired by each task. Lock acquisitions automatically record the current task attempt id by obtaining it from TaskContext. When a task finishes, code in Executor calls BlockInfoManager.unlockAllLocksForTask(taskAttemptId) to free locks.

Locking and the MemoryStore

In order to prevent in-memory blocks from being evicted while they are being read, the MemoryStore's evictBlocksToFreeSpace() method acquires write locks on blocks which it is considering as candidates for eviction. These lock acquisitions are non-blocking, so a block which is being read will not be evicted. By holding write locks until the eviction is performed or skipped (in case evicting the blocks would not free enough memory), we avoid a race where a new reader starts to read a block after the block has been marked as an eviction candidate but before it has been removed.

Locking and remote block transfer

This patch makes small changes to to block transfer and network layer code so that locks acquired by the BlockTransferService are released as soon as block transfer messages are consumed and released by Netty. This builds on top of #11193, a bug fix related to freeing of network layer ManagedBuffers.

FAQ

Why not use Java's built-in ReadWriteLock?

Our locks operate on a per-task rather than per-thread level. Under certain circumstances a task may consist of multiple threads, so using ReadWriteLock would mean that we might call unlock() from a thread which didn't hold the lock in question, an operation which has undefined semantics. If we could rely on Java 8 classes, we might be able to use StampedLock to work around this issue.
Why not detect "leaked" locks in tests?:

See above notes about take() and limit.

… in any respect.

SparkQA · 2016-01-11T22:26:22Z

Test build #49167 has finished for PR 10705 at commit 7265784.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-14T22:00:20Z

Test build #49411 has finished for PR 10705 at commit 7cad770.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2016-01-14T22:41:31Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

@@ -43,6 +43,7 @@ import org.apache.spark.rpc.RpcEnv
 import org.apache.spark.serializer.{Serializer, SerializerInstance}
 import org.apache.spark.shuffle.ShuffleManager
 import org.apache.spark.util._
+import org.apache.spark.util.collection.ReferenceCounter


Can you add a comment in this class that explains the ref counting mechanism? It can be a shorter version of the commit message.
Specifically:
What are the invariants? (explain get()) Need to call release. What does it mean if it is 0?

I slightly prefer pin count over ref count (the block manager has a reference but it is unpinned)

nongli · 2016-01-14T22:47:15Z

core/src/main/scala/org/apache/spark/util/collection/ReferenceCounter.scala

+/**
+ * Thread-safe collection for maintaining both global and per-task reference counts for objects.
+ */
+private[spark] class ReferenceCounter[T] {


Is there any reason you did it this way instead of a counter per object? Not sure how many blocks we have but this seems contention prone.

I need to maintain global counts per each object as well as counts for each task (in order to automatically decrement the global counts when tasks finish) (I'm working on adding the releaseAllReferencesForTask() call to the task completion cleanup code).

If I stored the global count per block inside of the BlockInfo class, then I'd still need a mechanism to count the references per task. If the counts for each task were stored in BlockInfo then I'd have to loop over the BlockInfo list on task completion in order to clear those counts, or would have to maintain the counts separately. As a result, it made sense to me to keep both types of counts in close proximity like this.

SparkQA · 2016-01-14T23:25:16Z

Test build #49422 has finished for PR 10705 at commit c1a8d85.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-15T00:15:29Z

Test build #49419 has finished for PR 10705 at commit 8ae88b0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-15T01:39:22Z

Test build #49427 has finished for PR 10705 at commit 575a47b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-15T22:00:08Z

Test build #49477 has finished for PR 10705 at commit 0ba8318.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-20T10:14:13Z

Test build #49773 has finished for PR 10705 at commit 90cf403.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-01-21T05:57:58Z

Jenkins, retest this please.

SparkQA · 2016-01-21T08:10:04Z

Test build #49861 has finished for PR 10705 at commit 12ed084.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2016-02-24T00:46:54Z

LGTM Good work!

SparkQA · 2016-02-24T04:17:17Z

Test build #51821 has finished for PR 10705 at commit b9d6e18.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-24T07:22:28Z

Test build #51841 has finished for PR 10705 at commit b963178.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-02-24T08:32:20Z

Jenkins, retest this please.

JoshRosen · 2016-02-24T16:38:00Z

Jenkins retest this please

SparkQA · 2016-02-24T20:33:07Z

Test build #51883 has finished for PR 10705 at commit 9becde3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-02-24T20:51:07Z

Jenkins, retest this please.

JoshRosen · 2016-02-24T23:44:40Z

Jenkins, retest this please.

SparkQA · 2016-02-25T02:04:33Z

Test build #51910 has finished for PR 10705 at commit 9becde3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-02-25T19:19:35Z

Jenkins retest this please

JoshRosen · 2016-02-25T19:54:42Z

Jenkins, retest this please.

SparkQA · 2016-02-25T21:39:34Z

Test build #51985 has finished for PR 10705 at commit 9becde3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-02-25T22:35:12Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

-            memoryStore.getValues(blockId).map(new BlockResult(_, DataReadMethod.Memory, info.size))
+            memoryStore.getValues(blockId).map { iter =>
+              val ci = CompletionIterator[Any, Iterator[Any]](iter, releaseLock(blockId))
+              new BlockResult(ci, DataReadMethod.Memory, info.size)


Right now there's still a chance that the programmer forgets to wrap the iter. I would actually push the CompletionIterator logic one step further into BlockResult itself, e.g.

private[spark] class BlockResult( iter: Iterator[Any], blockId: BlockId, val readMethod: DataReadMethod.Value, val bytes: Long) { /** * Values of this block, to be consumed at most once. * * If this block was read locally, then we must have acquired a read lock on this block. * If so, release the lock once this iterator is drained. In cases where we don't consume * the entire iterator (e.g. take or limit), we rely on the executor releasing all locks * held by this task attempt at the end of the task. * * Otherwise, if this block was read remotely from other executors, there is no need to * do this because we didn't acquire any locks on the block. */ val data: Iterator[Any] = { if (readMethod != DataReadMethod.Network) { CompletionIterator[Any, Iterator[Any]](iter, releaseLock(blockId)) } else { iter } } }

I did a search and could not find another place where we would not want to release the lock other than getRemoteBlock

If you push it further then BlockResult needs to hold a reference to the BlockManager.

I'll do this in a followup.

can we pass in an optional completion callback instead?

We still need to handle the DataReadMethod == Network case somewhere since there's no lock to release in that case, so having an optional callback in the constructor seems like it faces the same problem of someone forgetting to add it.

The difference is that now the programmer needs to explicitly completionCallback = None. If the completionCallback is specified then you don't need to do the network check. It's better in that today you have zero reminder that you need to release the lock by the end of the task.

Actually an even better way IMO is to have a LocalBlockResult and a RemoteBlockResult so there's no way the programmer can forget to release the lock.

By the way, I'm not quite done reviewing yet but feel free to address these in a follow-up patch.

SparkQA · 2016-02-25T22:39:30Z

Test build #51987 has finished for PR 10705 at commit 9becde3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-02-25T23:58:46Z

core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala

+    // A block is either locked for reading or for writing, but not for both at the same time:
+    assert(_readerCount == 0 || _writerTask == BlockInfo.NO_WRITER)
+    // If a block is removed then it is not locked:
+    assert(!_removed || (_readerCount == 0 && _writerTask == BlockInfo.NO_WRITER))


nit: clearer

if (_removed) { assert(_readerCount == 0 ...) }

andrewor14 · 2016-02-26T00:35:34Z

LGTM. There are still a few remaining issues about maintainability but they can be addressed in a follow-up patch.

andrewor14 · 2016-02-26T01:18:17Z

Merged into master.

tedyu · 2016-02-27T00:31:16Z

core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala

+      synchronized {
+        get(blockId).foreach { info =>
+          info.readerCount -= lockCount
+          assert(info.readerCount >= 0)


Should an exception be thrown here instead ?
In production, assertion may not be enabled.

CacheManager directly calls MemoryStore.unrollSafely() and has its own logic for handling graceful fallback to disk when cached data does not fit in memory. However, this logic also exists inside of the MemoryStore itself, so this appears to be unnecessary duplication. Thanks to the addition of block-level read/write locks in #10705, we can refactor the code to remove the CacheManager and replace it with an atomic `BlockManager.getOrElseUpdate()` method. This pull request replaces / subsumes #10748. /cc andrewor14 and nongli for review. Note that this changes the locking semantics of a couple of internal BlockManager methods (`doPut()` and `lockNewBlockForWriting`), so please pay attention to the Scaladoc changes and new test cases for those methods. Author: Josh Rosen <joshrosen@databricks.com> Closes #11436 from JoshRosen/remove-cachemanager.

CacheManager directly calls MemoryStore.unrollSafely() and has its own logic for handling graceful fallback to disk when cached data does not fit in memory. However, this logic also exists inside of the MemoryStore itself, so this appears to be unnecessary duplication. Thanks to the addition of block-level read/write locks in apache#10705, we can refactor the code to remove the CacheManager and replace it with an atomic `BlockManager.getOrElseUpdate()` method. This pull request replaces / subsumes apache#10748. /cc andrewor14 and nongli for review. Note that this changes the locking semantics of a couple of internal BlockManager methods (`doPut()` and `lockNewBlockForWriting`), so please pay attention to the Scaladoc changes and new test cases for those methods. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#11436 from JoshRosen/remove-cachemanager.

JoshRosen added 5 commits January 8, 2016 13:11

Add block reference counting class.

5d130e4

Make the ReferenceCounter generic, since it's not specific to storage…

423faab

… in any respect.

Merge remote-tracking branch 'origin/master' into pin-pages

1ee665f

Integrate reference counter into storage eviction code.

76cfebd

Merge remote-tracking branch 'origin/master' into pin-pages

7265784

JoshRosen added 2 commits January 14, 2016 11:59

Merge remote-tracking branch 'origin/master' into pin-pages

2fb8c89

Fix BlockManagerReplicationSuite tests.

7cad770

nongli reviewed Jan 14, 2016
View reviewed changes

Add unit test for pinCount > 0 preventing eviction.

8ae88b0

nongli reviewed Jan 14, 2016
View reviewed changes

Minimal changes to release refs on task completion.

c1a8d85

Fix Scalastyle.

575a47b

Merge remote-tracking branch 'origin/master' into pin-pages

0ba8318

JoshRosen added 2 commits January 19, 2016 15:22

Merge remote-tracking branch 'origin/master' into pin-pages

feb1172

Merge remote-tracking branch 'origin/master' into pin-pages

90cf403

JoshRosen added 2 commits January 20, 2016 17:54

Fix CachedTableSuite tests.

7f28910

Fix TaskResultGetterSuite.

12ed084

JoshRosen added 3 commits January 24, 2016 18:43

Merge remote-tracking branch 'origin/master' into pin-pages

43e50ed

Terminology update: reference -> pin.

1b18226

More terminology updates.

8d45da6

JoshRosen added 5 commits February 23, 2016 19:12

DeMorgan.

5df7284

Synchronize BlockInfoManager.registerTask()

eab288c

Minor comment fixes.

06ebef5

Check lockForReading outcome in downgradeLock()

0628a33

More logTrace detail in lockNewBlockForWriting

b963178

Move registration of task with BlockManager into Task.run()

9becde3

andrewor14 reviewed Feb 25, 2016
View reviewed changes

asfgit closed this in 633d63a Feb 26, 2016

tedyu reviewed Feb 27, 2016
View reviewed changes

JoshRosen mentioned this pull request Feb 29, 2016

[SPARK-12817] Add BlockManager.getOrElseUpdate and remove CacheManager #11436

Closed

cenyuhai mentioned this pull request Mar 10, 2016

[SPARK-13566][CORE] Avoid deadlock between BlockManager and Executor Thread #11546

Closed

JoshRosen deleted the pin-pages branch August 29, 2016 19:20

[SPARK-12757] Add block-level read/write locks to BlockManager #10705

[SPARK-12757] Add block-level read/write locks to BlockManager #10705

Conversation

JoshRosen commented Jan 11, 2016

Motivation

Changes

BlockInfoManager and reader/writer locks

Auto-release of locks at the end of tasks

Locking and the MemoryStore

Locking and remote block transfer

FAQ

SparkQA commented Jan 11, 2016

SparkQA commented Jan 14, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 14, 2016

SparkQA commented Jan 15, 2016

SparkQA commented Jan 15, 2016

SparkQA commented Jan 15, 2016

SparkQA commented Jan 20, 2016

JoshRosen commented Jan 21, 2016

SparkQA commented Jan 21, 2016

nongli commented Feb 24, 2016

SparkQA commented Feb 24, 2016

SparkQA commented Feb 24, 2016

JoshRosen commented Feb 24, 2016

JoshRosen commented Feb 24, 2016

SparkQA commented Feb 24, 2016

JoshRosen commented Feb 24, 2016

JoshRosen commented Feb 24, 2016

SparkQA commented Feb 25, 2016

JoshRosen commented Feb 25, 2016

JoshRosen commented Feb 25, 2016

SparkQA commented Feb 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 25, 2016

Choose a reason for hiding this comment

andrewor14 commented Feb 26, 2016

andrewor14 commented Feb 26, 2016

Choose a reason for hiding this comment