Sunday, February 06, 2011

When SQL IS Broken

I think it was Mr. Benedict in "Last Action Hero", who said:

"Eliminate all logical solutions to a problem - and the illogical becomes invariably true."




I received a distress call the other day, concerning a legacy application locking a SqlServer 2000 database, together with some profiler tracefiles.

Looking at the traces I indeed saw how certain statements were blocked for minutes, until other connections committed their transactions. Nothing extradordinary one might think, of course uncommitted data updates lead to data locks, which will block ensuing statements, that are trying to read or write the same data. Maybe those transactions were just taking too long. But here comes the catch: They did not contain any Inserts/Updates/Deletes, just Selects (the only reason for the existence of the transaction being, that a programmer some fifteen years ago had decided to activate implicit transactions). And with Isolation Level "Read Committed", and those queries lasting only a few milliseconds, there should not be any long-living locks.

Unfortunately, SqlServer 2000 tracing does not support "Blocked process report" events. This tracing event was only introduced in SqlServer 2005, and delivers in detail which SQL statements block which other statements (you can find it in the "Errors and Warnings" event category. Hint: don't forget to activate 'show advanced options', and set an appropriate 'blocked process threshold' via sp_configure()). Of course one can always have a look at the Management Studio Acitivity Monitor, but the statement information there is not so fine-grained (all it displayes is the last statement within a batch, not necessarily the one causing blocking). And I hadn't received any activity monitor data for my analysis.

So when I took a closer at the tracefile, all I could do was to scan through all of those Lock Acquired / Lock Released events. This is a really tedious task, not only because there might be hundreds of locks being acquired and released on any scope from row to table, but most of all because you only see which locks get acquired, and not which locks actually fail to get acquired (and hence lead to blocking).

The statement being blocked turned out to be a single row update. I saw a key lock and a page lock being acquired, and then things stood still. So what I did was to look for an identical statement within the trace, but one which would not be blocked. Luckily, I found one, and by closer examination, I noticed that the next lock acquisition would have been a page level lock mode upgrade from "Intent Update (IU)" to "Intent Exclusive (IE)". That's also not surprising, just standard procedure when a row is about to be updated.

The only thing that collides with such a lock upgrade would have been an existing "Shared (S)" lock on the same page (or a row within that page). But how could there be a shared lock, when no other query accessing the same data was currently running (taking in consideration Isolation Level "Read Committed")? OK, some minutes before, a query accessing the same table had been executed, and its transaction was still open (due to implicit transactions), but the statement's shared locks should be gone by now. OK, it didn't make a lot of sense to use a transaction for a read-only operation (again: Isolation Level "Read Committed", not "Repeatable Read"), and applying a clientside cursor was also kind of weird. But the cursor had been closed at this point, so it's locks should be gone. Legacy applications sometimes act weird, and we should still expect proper database locking.

I built a repro testcase, and sure enough, could reproduce it on SqlServer 2000. There were several shared locks on the query's connection, even once the query's cursor had been closed. In comparison, running the same testcase on SqlServer 2005 resulted to no dagling IS locks, hence no blocking. Well, it's anyway about time to upgrade ten year old database systems. You see, "SQL(Server 2000) WAS broken" in this case.