-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent SqlConnection from being held past an async boundary #22658
Conversation
I think the pool must be still upper bounded. there is no reason it is unbounded and cache 100 of connections when most of time 80 of them is just sitting there doing nothing. |
oops button clicked by mistake sorry. |
by the way, how many connections where there? surprised even though connections were held cross async, that lead to that many concurrent connectinos to be needed. |
The pool has a soft upper bound of the number of managed threads. I was actually originally planning to add a second commit to add a hard limit on the pool itself, but realized there's no longer a code path to do anything interesting there. The specific limit is the number of threads attempting to access the database concurrently. Given the threading mode, this is a good limit. We could further reduce resources over time by using a MRU and cache eviction, but I don't believe it will get us any further than we made it so far. Plus there seem to be other areas of the SQLite cache that could be improved first. |
The issue was found in a heap that had 2740. The screenshot for the issue was another heap with 3429. I've seen OOM heap dumps with ~900 all the way up to a winner of ~7000. |
I still don't see why upper limit is bad? if you really don't want, put crash there so that if limit is upper than we expect, we explicitly crash VS. |
hmmm.. still can't understand how there can be 7000 pooled connections. are we having thread starvation issue again? |
@heejaechang There is no need to crash VS since it typically manifests as UI delays, etc. The telemetry team has new red flags in place and will alert us if they see low memory conditions with a large number of these instances. Also, the use of the shared cache flag (which we should talk about separately) should prevent the cache from growing even if the number of connections does. |
We don't allow reads while writes are pending. Previously the read operations would claim a connection before waiting (asynchronously) for pending operations to complete. Now read operations return the connection to the pool before waiting. |
In other words, when FAR starts a per-document search, if the SyntaxTreeIndex is being written, then there will be a connection per document? |
I don't understand, if we are sure connection won't be pooled to 7000, then we should put that assumption in the code, if we are not sure, we should guard from it. otherwise, we are just hiding. |
@@ -53,7 +53,7 @@ public static SqlConnection Create(IPersistentStorageFaultInjector faultInjector | |||
|
|||
// Enable shared cache so that multiple connections inside of same process share cache | |||
// see https://sqlite.org/threadsafe.html for more detail | |||
var flags = OpenFlags.SQLITE_OPEN_CREATE | OpenFlags.SQLITE_OPEN_READWRITE | OpenFlags.SQLITE_OPEN_SHAREDCACHE; | |||
var flags = OpenFlags.SQLITE_OPEN_CREATE | OpenFlags.SQLITE_OPEN_READWRITE | OpenFlags.SQLITE_OPEN_NOMUTEX | OpenFlags.SQLITE_OPEN_SHAREDCACHE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flag says we are only using connection instances on one thread at a time. We were already doing this, but since we didn't specify the flag the database fell back to SQLITE_OPEN_FULLMUTEX which added a second layer of locks.
current implementation, if we are hit 100 at one point and never hit more than 20 again, the pool still hold onto 100 wasting 80. improvement on pool to do eviction and etc, we should remove upper limit when we implement that. no reason we don't do since we might do the eviction in the future. especially when putting upper limit is so simple. ... also, I still can't understand there is 7000 connection pooled. are we sure we are not leaking something? |
I don't know enough about the reasons why we read from the database to answer this directly. However, I do know two things:
There are some edge cases which make it very difficult to predict the hard upper bound. I'm willing to implement telemetry and perhaps a time-based pool eviction policy so we are aware of regressions, but to me it's not worth causing a user to crash (coded assertions) or severe performance degradation (SQL connections getting created due to exceeding pool limits). |
are you sure this cause severe performance degradation. currently we have evidence that pooling cause severe performance degradation - oom, but not the other way? especially with the change you made here, I think upper limit won't affect perf as much. |
by the way, any pool or cache that doesn't have upper limit is I think code smell. |
I agree. But one step at a time is good, especially on a rush change like this. |
The As for original case -- when developing #22340 I had file logging enabled, and I've seen a large chunk of connections running amok (I had a log file for each pid+thread_id combo on connection open, so I noticed lots of files quickly appearing), but it appears that having a fast CPU/disk prevents process death. I didn't report it in the #22340 because I believed it's the "normal" mode of operation and I had this even without proposed locking changes. With the connection pool limit, I'd put a cap on a connection pool. Crashing is of course not an option, but logging (not sure what policy Roslyn have for these kinds of errors) + wait queue/reverting to non-pooled approach looks fine to me (yes, it will slow down, and it would be another case to solve, but it's still an improvement over crashing). |
Given that we use more than one connection object, it appears
There is a soft cap in place equivalent to the maximum number of active threads concurrently accessing the database. Given that the threads are generally within the .NET thread pool, this is loosely related to the number of logical processors on the machine, but can vary a bit from that. I would not expect this to ever exceed the number of logical processors times two, and plan to verify this through telemetry. |
shared cache is beneficial to us since it doesn't cause each connection to be so expansive. |
Shared cache also causes concurrent write attempts on separate connection objects to block on each other. I'm hoping this is rare but we should be watching to see if performance reports deviate from expectation. For now it is acting as a secondary safeguard against runaway memory problems due to connection pool growing (and for the 15.5 Preview 1 release, will be the only remediation in place). |
by the way, with exclusive mode and shared cache, we might no longer need to wait for writer since only 1 process uses the db and doesn't require file system or mutex to communicate db's state with other process or other connection. |
not sure how write can't be serialized. I think WAL basically does that for us. before it was issue since multiple process can write to db, but I believe that is no longer possible? |
anyway, okay, if other people are okay with not having upper limit. |
The problem was simply that it was effectively 100s of K of transactions. Each write by some component into the persistence layer needed the full cost of being handled by sqlite properly. A single transactoin that inserts 10k rows is way better than 10k transactions that insert one row each. Even with WAL that remains true. If there's a way to configure sqlite to make it the same, then by all means switch :) But i could find nothing that would give the same perf. -- To repeat: i would love to remove the buffering layer :) it's just added complexity for a perf win. If we can get the perf win without it, be my guest :D |
I searched a bit and looks like what you did was the recommendation. :) |
… dotnet/dev15.4.x
3db065c
to
6c66464
Compare
@heejaechang Rebased this on dev15.4.x, and included #21532. @jasonmalinowski This should be the first PR we were after. |
I think we should bring this too (#22340) |
@heejaechang After this is merged, we will rebase #22340 and bring it in as a separate change. |
A word of caution. It appears that the problems I encountered with #22340 are related not to exclusive locking, they but to shared cache. With shared cache enabled, VS sometimes crashes (stacktrace below), with SQLite returning
|
@onyxmaster Sounds like I need to bring #21800 back as well, and then update it to treat BUSY and LOCKED the same way. |
…to dotnet/dev15.4.x Be resilient to getting busy messages back from sqlite while retrieving data.
6c66464
to
8c63dc6
Compare
This is now updated to include #21800. I then modified the two locations which handle SQLITE_BUSY to also handle SQLITE_LOCKED as a second return value with the same impact on our code. |
|
||
return true; | ||
} | ||
catch (SqlException e) when (e.Result == Result.BUSY || e.Result == Result.LOCKED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little nervous about this. Do we need to make the change that causes these 'locked' errors to start happening. What gain are we getting to warrant it?
My specific concern is that this now means that many more sql calls may have to be protected from locked errors, where we didn't have to worry about that before. There may be more places that are missing that may end up leading to actual application errors in the wild.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is the BUSY and LOCKED are similar, and the one getting reported depends on the current connection configuration. The original configuration produced BUSY errors (which we saw in Watson), but the change to use a shared and exclusive locks (required for follow-up work) changed a subset of those cases to return LOCKED instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CyrusNajmabadi I'd like to note that busy
states are almost the same as locked
, so the argument about unknown places in code that may fail are relevant to the busy
.
Update: didn't see Sam's comment before posting, I agree with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they are effectively 1:1, then i'm ok with things. My concern was that 'locked' might show up in different places where 'busy' woudl not have. If you guys are convinced that that's not an issue, then i'm def ok with this then :) Thanks!
Minor note: the #22340 no longer proposes the use of exclusive locking (and dependence on shared cache), so at least a part of the discussion might need reconsideration. |
Approved to merge via 507562 |
Fixes #22650
Rather than place a hard cap on the number of connections, I updated the code to ensure database connections are only held for the duration where they are actually in use.
CC: @dotnet/roslyn-ide @heejaechang @CyrusNajmabadi @onyxmaster
Customer scenario
A customer opens a large solution with 15.3 or 15.4. The IDE is at high risk for encountering OOM errors.
Bugs this fixes:
#22650
DevDiv 507560
Workarounds, if any
None
Risk
Relatively low.
Performance impact
For scenarios where database writes to disk are slow, performance is improved by allowing database connection sharing.
Is this a regression from a previous update?
Yes.
Root cause analysis:
As a best guess:
SqlConnection
andsqlite3
objects as heavier to account for this, similar to what we already did forMemoryMappedView
)How was the bug found?
The 425600 byte allocation pattern was discovered by Lee Culver on the .NET Fundamentals team. Text contents in some of these blocks made me suspicious of SQLite, leading me to search for "425600 SQLite". The value appeared twice in interesting results:
https://forums.plex.tv/discussion/256008/pms-crashes-randomly-under-load
http://iamoracleadmin.blogspot.com/p/sqlite-cheat-sheet.html
I then used PerfView to review the number of connections present in the managed heap, and found it exactly matched the number of native allocations.
Test documentation updated?
N/A