2009-03-16

'Reverse' Key (Relational Key) - John Flynn

John Flynn wrote:
It's a known performance problem in Sybase when your primary key has monotonically increasing values and you assign a clustered index to it. Apparently all row inserts go to the hotspot at the end of the table. I guess this is true whether your key is an identity-type thing or a natural key.
Not quite. A Natural or Relational Key will not be a monotonically increasing value. A monotonically increasing value (system or code generated) is a Surrogate key, by definition (the meaning of the word), it is the key that you have when you do not have a Key.
[bxA]
1 Apart from responding to your question below, the best advice I can give, after many years experience and many assignments is, there is no substitute for a Relational Key:

1.1 Remember, the Surrogate Key is an additional column, and an additional index. Therefore a guaranteed performance hit for many reasons, in addition the above.

1.2 work harder to identify a relational key that remains logical. Even a contrived key (eg. removal of vowels and double consonants from a name), producing a short name code, is far better than a surrogate.

1.3 See if you can contrive something, even something not "logical" such as CustomerName or State plus SmallDatetime.

2 An Identity is a Surrogate, but a Surrogate need not be Identity.

2.1 Many people use a NextSequentialKey table, which contains one row pre real table, containing the last used key, to generate the next key for each table. Although the Surrogate is a substitute for a Natural Key, this does not break the Logical Rule. This method (or other similar methods) is superior to Identity, in many ways.

2.2 Identity is physical, it breaks The Logical Rule. One of the main issues with an Identity is, the user is not aware of the columns when they insert the row. The code has to insert it, then read it back. Unless performed correctly, duplicate rows are easy to create. The whole idea is to ensure the row does not exist: this requires the clear determination of the rows by using only the columns that the user knows. In which case, the Identity is simply not required.

3 Never use a monotonically increasing value (Identity, next-sequential, datetime) for the APL Clustered index. This is easy to overcome: simple use the monotonically increasing value for an Unique Nonclustered index, and use a column that distributed the data for the APL CI.

(By the way, we have a technique that avoids all these issues altogether and guarantees an unique logical Relational Key, but it is commercial and thus I cannot disclose it. The combination of this and a correct Clustered Index distributes the data beautifully, with or without partitions.)
Anyway, in the specific case when you're using a surrogate key like identity, it occurred to me that isn't there a simple concept to let the values spread near-randomly across the possible range of values? That would be to take the data (depending on its particular binary data format) and "reverse" the bits or bytes? Conceptually, suppose you had a monotonically increasing key that was a three-byte numeric string, so the values went like 001, 002, 003, etc., up to 999. What if each time you generated a key, you reversed the bytes, so the sequence written into the database is 100, 200, 300, etc.? Those keys would be very well distributed across the whole range, yet they'd also be guaranteed unique because they come from the same generating system as before. Thus would that key not become appropriate for a clustered index? The same concept could apply to other datatypes, like a 32-bit binary integer, just reverse all the 32 bits. It's still an integer, but now it naturally distributes itself across the whole range.

This seems so simple, and something Sybase could have built into ASE, like as a new flavor of identity thingy. The fact that they have not done so leads me to think that there must be something wrong with the idea. What is wrong with the idea? I know that if your primary key is a natural key then this concept doesn't help. But what if it's a surrogate key?
Great minds think alike. A certain person (who shall remain nameless) has been trying to influence Sybase to do exactly that for some time. The progress is as follows. First a few items to tighten up the design, which may demonstrate its maturity.

1 Your idea is great but your example is not (a pattern forms after a while, and “random” distribution is not achieved), but that is easily overcome by a well known technique called Hashing the Key, common to non-relational systems, and in a different class of performance, where each of the bytes are hashed separately and then "added" together, which guarantees even distribution. Give the system a key and it will immediately return its computed location.

2 This technique has one limitation: the range of distribution must remain fixed as it is a component in the hashing algorithm. That is not such a big deal in that (a) one must use a value that one expects never to exceed (b) in the event that it is exceeded, simply unload/reload the table. The single read degrades to 1.1 when the table is full.

3 However, the physical space must remain fixed, and reserved in a new way, unlike any current space-allocation-for-table method.

4 It also increases the speed of single-row access phenomenally: the computational hash method eliminates the index altogether (eg. avoid working through a B-Tree height of 7 levels), but you do need the index for other purposes.

5 Upon submission of Hashed Key Tables to Sybase Engineering (not via the hilarious "ISUG Enhancement Request Process" with all of 12 voting members), the person was informed that Sybase had already thought of that; implemented it; that it should be tested and feedback was invited. It was called Virtually Hashed Tables.

6 Further investigation revealed that it was only available on one platform: IBM P-Series Linux, both 12.5.4 and ASE 15.0.2. There is no information re availability on other platforms (there was some talk, but no evidence, that Sybase was planning on releasing it on all platforms; in the enhancement game that means no plans).

7 I do not have that equipment, and thus I have not tested it, but I reviewed the doco carefully and exchanged correspondence on the subject.

8 The implementation has a gross limitation as I have reviewed it: each VHT requires its own private segment. With the 29-segment limit, that is a joke. I can understand the need for a discrete segment, much like logsegment, for any/all VHTs only, and I have no complaint about that, but I baulk at one segment per VHT. The very people who want VHT (performance types) would be using segments, and they can ill-afford to lose one segment, never mid one segment per table.

9 That has been the situation unchanged for two years.

10 The Sybase ASE product manager is not identified, and that is the only person I have not pressed this issue with.

11 Separate to that, I have lodged a public ER through the identified process, to eliminate the 29 segment limit. That is required for other reasons, such as providing the level of high performance available to small and medium-sized databases, to VLDB. The reason I am mentioning it is, this will eliminate [8] as a problem, and deem VHT as useable.

Since the ISUG process is a joke, I would not recommend that you spend money just to vote on the established ERs or push VHT; I recommend you write to Sybase formally, and keep going up the chain.

Finally ...

Back to your general idea, and Rob’s. I recommend against it, because there is a translation barrier: the user does not know the real key (this disallows access to database via tools other than the app; one of the classic results of breaking the Logical Rule) and it has to be translated back and forth at every interaction step. The main point here, again, is to emphasise the value of the Twelve Rules, and to use a logical relational key that is meaningful to the user. For which there is no substitute.

Separately, Rob’s idea of inserting into a surrogate (that word again) Identity table just to obtain the Identity value, while reducing some of the side effects of Identity columns, poses an active contradiction to performance.

Cheers
Derek
Senior Sybase DBA / Information Architect
Copyright © 2009 Software Gems Pty Ltd

No comments: