2009-03-15

Data Distribution - Mario Ascione

Distribution of data over different disks for better disk utilization

Without posting the entire thread, which has a somewhat different context, I have picked it up at the point where the unanswered part starts. Mario Ascione posted:

The table has a unique clustered index on the primary key "test_id" [identity] with lock datarows and some non-clustered index.
That is not a relational key, it is a physical, non-relational key which will cause a problem everywhere it is used. Let's look at one instance ...
[bxA]
Filling the table happens what [person] said (first reply):
"Note: Once you place a clustered index on your roundrobin partitioned table, future inserts will be placed sequentially (not randomly) on the various segments on which the partitions reside; in other words, the partitions will become unevenly populated with new inserts until the clustered index is re-created (which will equally redistribute the data, in index order, over the partitions/segments)."
I do not agree with the person's wording, but in the main ... with a view to continuing this post without going back and forth ... ok.

So only the first partition is populated while the others remain empty.
Well, that is what you have told ASE to do. The "clustered" index tells ASE to PLACE the rows in that physical order [it doesn't really but that is a different point, outside the scope of your problem]. The partitioning tells ASE to use RR, to place the rows in the heap (not the "clustered" index) in the four partitions in RR sequence. The Identity "key" is always going to be next-sequential: that means the "clustered" index remains a B-tree with true next-sequential leaf ordering (the space usage is probably not visible to use); but the data rows are RR across 4 partitions. BUT you ran your test from one session, one spid: that does not provide RR; the RR is for each sequential insert from DIFFERENT spids; all inserts from one spid go into one partition.

Basically you have a few contradictions, and ASE it doing "what you told it to do".

There is another issue, it is hard to learn by yourself, making mistakes and correcting them one by one. It is easier to learn from examining good implementations.

I have to drop and recreate the clustered index in order to distribute data evently over the four partitions.
Yes. In this type of situation only. In other situations, definitely not.

AND this is invoking the characteristics of the "clustered" index. But right afterward, as you insert new rows, it goes back to uneven distribution.

Some questions:
How can I avoid drop/create clustered index?
You can't in this case. If the test was more real (multiple inserting spids) the data would be distributed ... but that conflicts with the CI definition you have given it, so "suck it and see".

I've read that It's wrong use CI on a datarows table.
It is not "wrong" to use a CI on DPL/DRL. First, the CI is not a CI ala APL, so do not expect that operation. Second it is PLACEment only; it remains a separate detached B-Tree, but on the same segment as the data heap. Third an Identity or next-sequential number or datetime is the wrong key for a CI. Basically, 90% of the time, people choose the wrong key for the CI on DPL/DRL, and they have incorrect ideas re placement/distribution. Therefore I suggest not having a CI on DPL/DRL, just have NCIs, you lose nothing except the confusion.
Should I use a non-clustered index instead of clustered one?
In this case, that would be a simple way of overcoming the confusion and getting what you intend. Drop the CI; create an NCI on test_id; run your inserts from four spids, and you will get even distribution of data across the partitions.

Thanks Derek for the clarification.....

just one doubt:
what do you mean with "four spids"?
number of worker process? I suppose not (the worker process is an ASE internal thread; I've configured 12 wp).
multiple threads ("session") performing insert operation? yes means that partition table is good "only" in multithread environment.
Others?
Correct. Worker processes are internal to ASE and support parallelism, I wouldn't express it as "only good for multi-threaded environments", the server is beautifully multi-threaded; parallelism needs a multi-processor environment. Each spid/connection will have many WPs.

By "four spids" I mean four sessions, four separate invocations of isql, which will simulate four separate user/client connections in a live system. The inserts of each separate spid/connection will "Round Robin" across the heap partitions in a DPL/DRL table.

1 comment:

quelli della 5F said...

Hi all,

just for close the issue; I've done a test:
dropping the clustered index on a partitioned table and leaving the table only with non-clustered index, data insertion is distributed evenly across partition.

Thanks
Mario