2009-03-31

Tempdb Monitor

Rajasekhar Reddy asked:
Could you please let me know the best way to get alarms before tempdb gets full ?
A method of dealing with tempdb filling up, and analysing the causes (thus preventing them in the future) is provided here. This is a second generation method, useful to all ASE servers, with unknown, untamed tempdb content; the first generation rendition is quite adequate for tamed and controlled databases. This has been implemented and tested in large Production systems. The approach is important, and different from other approaches that have been evaluated, in two requirements:
[bxA]- monitoring and analysis of the conditions that lead to tempdb full (rather than merely reacting to a full tempdb)
- administering Sybase from external scripts (this should be quite ordinary for test and production systems; scripting for Sybase is not discussed here)

Cause & Prevention

Handling the symptom without understanding and addressing the cause is a never-ending story of fighting fires, quite unfulfilling. The two common causes and prevention are identified here, so that the firebugs has be hunted down. tempdb is the busiest database on any Sybase server. Any effort that is made to affect these two areas will substantially reduce tempdb usage, and therefore affect overall server performance.

1 A large tempdb, and the event of it filling up, is caused by the over-use of temp tables; particularly re-creating large result sets as an intermediate step in processing. As the main database grows, tempdb has to grow as well. This is caused by the lack of Normalisation; in fact a large and active tempdb is a classic hallmark of this condition.

2 For Normalised databases, the second cause is poor SQL coding that fails to take advantage of the power of Relational paradigm (joins, subqueries, set-processing, conditionals), and instead reverts to creating intermediate worktables and uses procedural processing (row-processing; cursors or cursor substitutes). That is, the coder has a Relational Database but has coded in non-relational style.

User tempdbs provide a containment factor, however, if the cause is not addressed, these simply spread the pain and present additional administrative tasks to the DBA.

Synopsis & Method

Two thresholds are identified:

1 A Report Threshold for space used by an object in tempdb, below which the object is of no concern, and above which the object is reported (eliminates the multitude of small objects from the report)

2 An Alarm Threshold, for space used by an object in tempdb, which when exceeded, triggers an alarm (e.g. mail being sent to the DBAs).

Both thresholds are expressed as a percentage of the allocated tempdb default segment space, with set defaults of e.g. 2% for objects worth reporting and 20% for objects which cause alarm. If tempdb default and logsegments are not separated, then the percentages should be slightly lower, to compensate for that.

The method consists of three components:

1 A table tempdb..sg_object (not a hash table) which holds a series of snapshots of objects which exceed the Report Threshold. This is placed in model, so that it is created in tempdb on server boot.

2 A stored proc sp_helptempdb_sg which can be executed from anywhere, on any [user] tempdb:
- both Report and Alarm Thresholds have defaults (e.g 1% and 20% respectively) which can be overridden with execution parameters
- identify the objects residing in tempdb, from sysobjects and sysindexes, which are above the Report Threshold
- store them in tempdb..sg_object
- report such objects for the current execution [unless executing Silently]
- report "ALARM" (an additional column) for objects that exceed the Alarm Threshold

3 A shell script sg_monitor_tempdb, within the structure for normal DBA scripts:
- invoked by crontab every minute or every five minutes
- execute sp_helptempdb_sg
- capture the output_report
- grep the output_report. If "ALARM" is found:
--- mail output_report to admin_mail_list (e.g. same list as for "dump failed" condition)
--- [if login exists in a kill_list, kill spid]

Analysis of Cause

Note that tempdb..sg_object remains for inspection of specific causes, via ordinary simple SQL, logins, objects, and applications can be readily identified. Over time, patterns and trends can be analysed. It is simple enough to connect a spreadsheet application to it and analyse the content. This provides the basis for identification and thus prevention of the cause of large objects in tempdb; the ability to approach the culprits with evidence; charge-back for resource usage on a user-pays basis; etc.

The table may be filtered and archived into a permanent table in a user database, where longer term trend analysis is relevant.

Related

Always set sp_dboption 'abort tran on log full' for [user] tempdbs. This eliminates the condition, where many users are hung in log suspend, and waiting for the DBA to clear it; they get aborted and cleared right away.

Considerations

1 User tempdbs may be monitored as well. For sanity reasons, all reports for all tempdbs monitored are placed in the main tempdb.

2 Once the structure is running, the thresholds have been modulated to suit the particular site and tempdb use, and confidence has been established, the enhancement to kill the relevant spid may be implemented.

3 Using resource limits is possible in some but not all circumstances, but the facility itself is limited, and therefore I do not provide recommendations for it.

4 MDA tables could be used. However there are many problems with them, not the least of which is inability to join tables and substantial overhead (as per documentation). In order to use them effectively, one must set up a capture-and-store mechanism. They are thus not worthy of consideration for code that must execute correctly under duress.

Limitations

1 This method captures all forms of tempdb objects except internal worktables (there is currently no method of identifying worktables in Sybase). In any case, worktables are proportionate to temp tables, thus considerate modulation of the thresholds will cover worktables. (Some might suggest that MDA tables provide information re worktables, but that is false, only the number of tempdb tables and worktables per process is provided.)


2 This method does not capture the condition where cartesian products which exceed tempdb size within any single iteration of the monitoring script (ie. one minute). Note that cartesian products s are unintended, and controlled by other means; they are not the same as large intermediate result sets that keep growing.

First Generation

In order to report any segment full, and hopefully do something meaningful if it is logsegment, sp_thresholdaction has to exist in any case. In the first implementation of this method, I modified the threshold proc, to populate tempdb..sg_object, if it were triggered for tempdb default segment at 70%, 80%, 90% and 95% full. That worked well for well-behaved databases and servers, where tempdb is not running riot, where the main issue is spikes in tempdb usage. Those events are usually the result of new code (by new coders, refer Cause [2]), or report tools running without configuration limits. That method stores a snapshot at those specific trigger events only, and is recommended for standard-compliant, quality databases.

For sub-standard databases and anti-social applications, where tempdb does run riot, something more, in terms of capturing the build-up to tempdb full, is required.

Code

Software Gems provides all relevant code for all objects identified here, at a commercial quality level, as a fully integrated part of a Sybase Server & Database Administration suite. That includes housekeeping tasks suh as trimming and archiving sg_object. The code required uses ordinary Sybase ASE facilities, it is simple enough to write (ask questions), and it is presented here for those who wish to write their own. The superiority of this method over others, lies in the approach.

Copyright © 2009 Software Gems Pty Ltd

No comments: