swarm/storage/localstore: add description for gc size counting

ethersphere · Feb 5, 2019 · 40432d9 · 40432d9
1 parent f056e86
commit 40432d9
Showing 1 changed file with 73 additions and 0 deletions.
diff --git a/swarm/storage/localstore/gc.go b/swarm/storage/localstore/gc.go
@@ -14,6 +14,79 @@
 // You should have received a copy of the GNU Lesser General Public License
 // along with the go-ethereum library. If not, see <http://www.gnu.org/licenses/>.
 
+/*
+Counting number of items in garbage collection index
+
+The number of items in garbage collection index is not the same as the number of
+chunks in retrieval index (total number of stored chunks). Chunk can be garbage
+collected only when it is set to a synced state by ModSetSync, and only then can
+be counted into garbage collection size, which determines whether a number of
+chunk should be removed from the storage by the garbage collection. This opens a
+possibility that the storage size exceeds the limit if files are locally
+uploaded and the node is not connected to other nodes or there is a problem with
+syncing.
+
+Tracking of garbage collection size (gcSize) is focused on performance. Key
+points:
+
+ 1. counting the number of key/value pairs in LevelDB takes around 0.7s for 1e6
+    on a very fast ssd (unacceptable long time in reality)
+ 2. locking leveldb batch writes with a global mutex (serial batch writes) is
+    not acceptable, we should use locking per chunk address
+
+Because of point 1. we cannot count the number of items in garbage collection
+index in New constructor as it could last very long for realistic scenarios
+where limit is 5e6 and nodes are running on slower hdd disks or cloud providers
+with low IOPS.
+
+Point 2. is a performance optimization to allow parallel batch writes with
+getters, putters and setters. Every single batch that they create contain only
+information related to a single chunk, no relations with other chunks or shared
+statistical data (like gcSize). This approach avoids race conditions on writing
+batches in parallel, but creates a problem of synchronizing statistical data
+values like gcSize. With global mutex lock, any data could be written by any
+batch, but would not use utilize the full potential of leveldb parallel writes.
+
+To mitigate this two problems, the implementation of counting and persisting
+gcSize is split into two parts. One is the in-memory value (gcSize) that is fast
+to read and write with a dedicated mutex (gcSizeMu) if the batch which adds or
+removes items from garbage collection index is successful. The second part is
+the reliable persistence of this value to leveldb database, as storedGCSize
+field. This database field is saved by writeGCSizeWorker and writeGCSize
+functions when in-memory gcSize variable is changed, but no too often to avoid
+very frequent database writes. This database writes are triggered by
+writeGCSizeTrigger when a call is made to function incGCSize. Trigger ensures
+that no database writes are done only when gcSize is changed (contrary to a
+simpler periodic writes or checks). A backoff of 10s in writeGCSizeWorker
+ensures that no frequent batch writes are made. Saving the storedGCSize on
+database Close function ensures that in-memory gcSize is persisted when database
+is closed.
+
+This persistence must be resilient to failures like panics. For this purpose, a
+collection of hashes that are added to the garbage collection index, but still
+not persisted to storedGCSize, must be tracked to count them in when DB is
+constructed again with New function after the failure (swarm node restarts). On
+every batch write that adds a new item to garbage collection index, the same
+hash is added to gcUncountedHashesIndex. This ensures that there is a persisted
+information which hashes were added to the garbage collection index. But, when
+the storedGCSize is saved by writeGCSize function, this values are removed in
+the same batch in which storedGCSize is changed to ensure consistency. When the
+panic happen, or database Close method is not saved. The database storage
+contains all information to reliably and efficiently get the correct number of
+items in garbage collection index. This is performed in the New function when
+all hashes in gcUncountedHashesIndex are counted, added to the storedGCSize and
+saved to the disk before the database is constructed again. Index
+gcUncountedHashesIndex is acting as dirty bit for recovery that provides
+information what needs to be corrected. With a simple dirty bit, the whole
+garbage collection index should me counted on recovery instead only the items in
+gcUncountedHashesIndex. Because of the triggering mechanizm of writeGCSizeWorker
+and relatively short backoff time, the number of hashes in
+gcUncountedHashesIndex should be low and it should take a very short time to
+recover from the previous failure. If there was no failure and
+gcUncountedHashesIndex is empty, which is the usual case, New function will take
+the minimal time to return.
+*/
+
 package localstore
 
 import (