Thursday, September 17, 2009

Data Loads

One of the things I really like about SQL Server Integration Services is that it can load large data sets quickly, especially if you need to look up values along the way.  However, there are times that things behave in strange manners that seem hard to explain.  We were recently trying to load about 64 million rows into a warehouse-type database and kept hitting a bottleneck once we got past around 30-35 million rows.  No matter what I tried, the load would run at about 2 million rows per minute, gradually slowing down until it reached the 10,000 rows per second load speed.  I found several helpful posts, including some pointers to check the MaxRowBufferSize and similar settings.  Tweaking these helped someone who was on a memory-bound machine with a very similar problem.  I tried that and saw my initial load times improve, but still slowed down to a crawl at around the same point.

I learned more about looping through (aka "shredding") a recordset of CustomerID values to try to go through that set.  I set up a FOR Loop to loop through 5 million records at a time, I checked for processes running against the table. Nothing seemed to help.  I had posted information on Twitter, had a co-worker looking at the package with me, and even got some replies from Paul Randal.  (Thank you for the assist, Paul - sorry I wasn't checking my tweets more regularly that time.)

In the process above, I was challenged trying to set the CustomerID in some dynamic fashion on my source query and eventually resorted to treating the SQL Command as an expression and hacking in my variables into the WHERE clause.  It worked, but definitely felt awkward. I am more than open to some suggestions about how to use SSIS variables inside of a FOR loop as part of the OLEDB Source Command.  I also learned a little more about watching variable values - like needing to have a breakpoint set before they'll be available.  And my co-worker pointed me to a wonderful discussion showing that DtExecUI runs only in 32-bit mode. I'd been using that as a quick/easy way to get my parameters plugged in and to keep an eye on the general progress of the load.

About the same time that Paul suggested checking my index fragmentation levels, I remember seeing a very similar behavior pattern several years ago when a co-worker was trying to update a column in the middle of a somewhat wide clustered index on a multi-million row table. The server was thrashing all over the place, trying to re-order millions of rows as this command took place.  I checked the table and sure enough, there was a wide clustered index on the table that would cause exactly that behavior.  I don't know why I didn't check it before other than thinking that I'd just created a test table and no indexes on it.  In retrospect, that was poor planning on my part.  For the future, I'll remember to check clustered indexes before I try to load millions of (unsorted) rows into a new table.

It was a somewhat humbling experience, mostly because one of the basic things I know about loading lots of data was also something that I just completely ignored when doing all of my troubleshooting. Still, a little humility is a good thing and I've now re-learned several programming techniques in the process. The best part about this was seeing a 6 hour load process reduced down to 30 minutes, including dropping and re-creating the clustered index. That's going to make a noticeable difference in our nightly processes.

No comments:

Post a Comment