2009-03-18

Normalisation Workshop 1 - Niksa & Others


Value

This post is best described as per the title. It is not a top-down formal education re the subject. It has substantial value because it deals with several problems that are common in the real world, that damage the quality and performance of Relational Databases. This provides the method for providing a high quality, high performance database, without breaching Standards and Rules.

There are several elements in the workshop that are of value to Developers (more) and Data Modellers (a bit less). Its value is in the example, so do not get too caught up in the edges and the tangents. The sequence/exchange is worth following because it is virtually a script, based on common myths, and deals with the blocks to learning. Depending on how carefully you inspect each step, there are many small but important items to be noted, in the context of the operating whole Relational Database, and resolving each new requirement between the Modeller, the Database, and the Developer, as they arise.

Warning Since this thread has the back-and-forth nature of a discussion, a workshop, and it goes into detail, it will be long. I will keep it trimmed and up-to-date, as the comments are integrated in to the post. When the discussion is closed, I will publish a concise and abbreviated Revision for posterity.
[bxA]
Chronology
1 This table was submitted to me for implementation in the database

create table Transaction (
____CustomerId int, -- PK.1
____Date date, -- PK.2
____SequenceNo smallint, -- PK.3
____DebitAmount money,
____CreditAmount money
____)

Along with a report that was required by the user:
result.pdf

It is an excellent example of the common problem I have dealt with hundreds of time, on scores of full implementations, in my role as Modeller/DBA working with a Development team, which is why I am using it in this workshop.

This “table” is easily recognised as a brilliant scratchpad to supply a single user’s requirement. Unacceptable for the existing Relational Database. Unnormalised. Two facts in one place. No 1::1 ratio for PK::Attribute. The code required to deal with this would be as ugly as sin. If implemented, it would form a hard fixed structure, an island in a standard-compliant sea of database, and require special maintenance. (Ask questions if you do not understand any of these points.)

Context
A In order to avoid confusion of roles, at least in this thread, I am the Modeller/DBA, and the people requesting (a) assistance and (b) that their objects be accepted and placed in the database, are Developers.

B I administer Relational Databases, not blobs or data heaps, to Relational and design/implementation standards. That means they are fully Normalised; all objects are integrated with all other objects; Declarative Referential Integrity is defined and maintained by the server. This provides the highest level of performance and Relational power.

C Developers need to understand the application problem and the data; to develop prototype code, perhaps using a scratchpad. As DBA, I can guarantee you that you can do what you want to do in the ANSI SQL RDBMS that I am administering. But it may not be the way that was done on the scratchpad. Both parties have to conform to standards, so that we all speak the same language, and adhere to the same goals and rules for achieving them. These standards are not "mine", they were created for the express purpose to nullifying arguments, by nationally recognised standards bodies. In order to do a professional job, we must understand them and apply them:
- Structured Query Language, IBM 1970
- American National Standards Institute
- Relational Database Management Systems (Normalisation which pre-dates it) E F Codd & C J Date
- And finally an implementation platform such as Sybase ASE, with its extensions

D As Modeller/DBA, guardian of the Related Database as my identified job description in the organisation, to ensure that all users needs are met, within budget and hardware/software constraints. This means I do not allow new requirements which contradict with existing, implemented requirements; and I keep the organisations long-term goals (Data Model) clearly in mind. Developers have less power than users: there is no way some developer's unnormalised, single-user, fast-in-the-isolation-of-their-mind structure will be added to the Relational Database.

E We need to put aside all pre-conceived notions (fixed ideas) re what is "slow"/"fast", etc., and find out honestly what is slow/fast. What may have been slow/fast on the scratchpad in isolation is quite different in the context of a real Relational Database running live with hundreds of users. In order for the learning process to progress as quickly as possible, if at all, it is best that the Developer understands the Normalised article, and codes for that (as opposed to demanding that the unnormalised scratchpad be placed in the database or squealing that the code will be "slow".) Essentially, we need to maintain the identified roles. Slow or fast is irrelevant in isolation, we are working with a corporate Database which is used by the entire organisation; slow or fast will be honestly determined in that context before the end of the exercise.

What is required here uses simple SQL and Sybase, on Relational tables: no "clever" code, with its maintenance problems (ensuring a job for life for the um developer); or "new features"; or blaming others ("we can't do it until Sybase gives us what Oracle has"), are required. What is called for is to make the best use of the facilities and Data that exists.

Back to Chronology

2 In order for the user's requirements to be realised, since the “table” [1] was not Normalised, and since I am the Modeller/DBA, the Developer requested this table to be Normalised. So I did my job, Normalised the table, brought it up to other standards (not discussed here), and implemented it as a Relational Table in the Database, as an integrated part of it, related to existing tables, for use by all the users in the organisation (and not one particular user). Refer file:
DDL.pdf

(Ask questions if you do not clearly understand the change from the unnormalised submission to the Normalised table.)

2.1 In order to ensure that we come up with the intended results correctly, a set of prepared data (known values) is included in "DDL.pdf".

2.2 I have checked, to ensure that the "result.pdf" report is entirely possible; all data can be derived directly from this table, without any fuss or complexity (it may be complex in other platforms, and marvellous new features may be required there, but it is dead simple and very fast in Sybase, now that the data is Normalised and placed in a Relational Database). The code I wrote is merely for the purpose of verification before I make such assurances to developers, users, and bosses. It is "not my job" to produce this code, it may not have all the structures and standards for code, as defined by the organisation. It is the developer's job to provide the code for the user.

2.3 The "result.pdf" shows the correct report as per user requirement has been produced. I have included both the required output report, and the statistics, on one page in the "result.pdf". The statistics are merely to prove that the results were in fact produced, that the SQL required is possible. The statistics can be safely ignored for now, do not get caught up in the tangents, tiny code often executes in 0 ms. Discussion of this before discussing the code, would be premature. Likewise the efficiency (which code is better than which other code) is best left until after submissions have been made, and evaluated in the full context.

2.4 Since ORDER BY Date and GROUP BY Date are required to produce the derived data in the report, a worktable is demanded and cannot be avoided. At no stage have I suggested that submissions should not use worktables. Derived data is ordinary and normal, the consequent worktables are ordinary and normal. The ability to write code to derive columns from data is normal.

2.5 Duplicate data is abnormal. Inability to work with Databases, and write code to produce derived columns is abnormal. Inability to give up attachment to personal, non-corporate data is a severe impediment.

2.6 The code should rely on the data only, and expect to perform Relational functions only. It should not rely on the current Clustered Index, the physical ordering of rows, or the price of fish.

2.7 The stats produced by non-compliant code that relies on 100% duplicate data or other substandard structures, are irrelevant to this thread.

3 The ball is in the Developer's court now, to develop SQL using the existing Database objects (forget about past objects, they were rejected) to produce said "result.pdf".

3.1 Any attachment to obsolete structures and substandard code will hinder the learning process. Better to understand the tables as they are and develop new code. The data may not be where it was, but I can assure that data is all there, possibly in different locations; and completely accessible using ordinary SQL. As per [E], do not worry about joins and subqueries; these are the ordinary constructs required to access and produce Derived Data from a Relational Database; we'll make sure the joins, subqueries and worktables aren't too "costly" before the day is out.

3.2 Do avoid cursors and temp tables.

3.3 The code will be reviewed and upgraded to standard if required, so do not worry about cutting perfect code the first time. This is about learning and understanding, not about perfection or competition.

3.4 If the code-cutter cannot cut code, there are many resources (books, manuals, courses, reading the code of others), and ultimately many people from whom to ask for help. I know a Modeller who knows a thing or two about SQL, who would be happy to help. But asking for help means submitting code (working or not) first, not asking me to code from scratch (that is doing the Developers job for them, and nothing is learned).

3.5 Some Developers have mental blocks, that hinder them from using the existing Database tables, due to their attachment to the spreadsheet-style contents on their C drives. This prevents them performing their duties effectively. The thread expects to address that, as part of the learning process, and overcome that, but that is limited by other person's ability to change.

Go for it

Now code, and submit it. An hour or so should suffice.

Cheers
Derek
Senior Sybase DBA / Information Architect
Copyright © 2009 Software Gems Pty Ltd

Psuedo-relational is a dog's breakfast
True Relational is simpler than you think

Submission 1

Niksa:
I thought the [implemented] table was irrelevant. Sorry.
See, that is exactly the problem. The table has the basis, the seed of the Law, and you did not even notice.

The reason you did not notice, is because all tables are irrelevant to you (only code is relevant to you). To you “tables” come and go, they can be changed to fit the code. Or worse, another “table” can be added, creating a duplicate or triplicate.

In the corporate world, Data is the identified asset, it changes little if implemented properly (gets replaced otherwise); programs, code, come and go, and are cheap to replace.

If coders are going to last in the organisation, they need to learn respect for data (without giving up their respect and ability with code). Real Database data is easy [a] [b] but expensive to change, it affects everyone and many code segments; that is exactly why developers often demand that their own personal or private data "table" is implemented. That leads to massive duplication and maintenance problems, which lead to complete replacement, which we have to avoid. It is better to learn to work with the corporate data; to implement smaller, correct, controlled, standard-compliant changes to the Databases [which I had already done last week]; and to code for that.

a. it took me all of 10 minutes to implement the users requirement in the existing Database, with full standards.

b. it is hard to change, and affects more people, if there are no standards implemented.

Submission 2

Niksa:
Here the output produced by running my own code written in the spirit of [person's] algorithm.
[link]
Does this qualify me to enter the classroom?
Almost, but unfortunately, no.

This is your best attempt so far, though. At least now you are using a Database table. But not the existing Clustered Index. Refer [2.6].

If you can read and follow instructions, which everyone else has to follow, you will get in.

Oh, and post the code, it probably wipes the floor with my code.

Submission 3

Niksa:
... The SequenceNo and the Date in CustomerTransaction table may not be "in sync" ...
There is no relation between SequenceNo and Date. As a Relational standards-compliant DBA/Modeller, even if there were an implied relation, I have to state that it is not identified, determined, or supported. Your code should not rely on it. Use CustomerTransaction.Date only to drive your code. (Note the required report is driven by Use CustomerTransaction.Date, anyway.)

I am open to changing it but that is a physical, low level aspect of the Database. As a Developer, you are required to work with the logical Database
only. The physical aspects of the database are "none of your business". If a certain query (not this one) is slow, and needs an additional index, I am happy to look at it.

(As detailed in [2.6], I will not be changing the CI for one developer).

The only relevant order in the report is the order of transactions occurrences (the Date in the Database Table).

The physical order of the rows in the Database Table is not relevant to you, should not be relevant to you, and should not be relied upon by any code segment.

Submission 4

Niksa:
... [not relevant] ...
A couple of Definitions, to assist everyone:

Database: Normalised set of Tables with Referential and Data Integrity:

Report: A projection, a view, a list of data from the Database, produced by Joins and Queries (includes subqueries) using SQL. This projection, by definition, is Denormalised.

Some people get them mixed up.

And a couple of reminders:

a. This is not an accounting class.

b. This is a Relational Database class. It uses an accounting example. Whether the example is applicable to the real world or not, is irrelevant to the class.

Submission 5

Niksa:
... [not relevant] ...
The strong belief that you know the problem and therefore the solution, is actually the block itself. In order to learn, we must take the position that we do not know. Otherwise we sit at the back of the class, making comments about how much we know more than the professor in order to feel better, disrupting the class ... and failing. It is a well-known way of taking the focus off the problem (your inability to write the required code) and blaming someone else (the professor is clueless about the real world, here let me show you how they do it in the real world).

Your focus on the relative value of the exercise (wrong place) is hindering you from performing the exercise (focus on that only). It is like the fat kid who says the wooden horse in the gymnasium that everyone has to jump over, is taller than a real horse. Who cares, just jump over it. Everyone who wants to qualify has to.

If you write physical code, you will be breaking the Logical Rule and your code will be glued at the hip to the physical ordering in the table, which will change, and your code will break. Which is one crucial reason why you have not qualified (and note that qualified people can [a] write logical code, that [b] does not break every time the physical aspects of the Database change.) Physical code will be rejected, as stated from the outset.

Submission 6

Niksa:
OK. I've just corrected myself: UCLI (CustomerID, SequenceNo).
[link]
That is great, excellent progress !

a. The code produces the required result. Using the existing Database Table. Regardless of the existing Clustered Index.

b. It appears to use just one worktable, and one scan of the Database table. In that regard, it appears to be much better than my “proofing” code, but then you are probably a better SQL coder than I. (I did predict that you would probably wipe the floor with my code.)

c. You have submitted evidence that the required report has been produced, but not the code.

d. The problem with your submission, however, is that you identify and emblazon your work as “Unnormalised”. By Definition (your initial request for the new “table” to be implemented; which was determined to be Unnormalised; which means you do not understand Normalisation; this is a Normalisation class), and by Role (I am the Modeller; you are the Developer; you are trying to obtain qualification), you are not qualified to make that statement.

e. By making statements which are contradictory to the exact lesson, you have sabotaged your own submission. In order to qualify (for the class, or in the context of a Developer working with a Modeller and a Relational Database), you will need to refrain from making unqualified and self-contradicting statements.

f. The second problem is that you identify the existing CI as “flawed”. This identifies that you are focused on the physical, an aspect of the Database that is outside your job description, that should be irrelevant to you; that is none of your business as Developer, as per [2.6] and Submission 2. This further determines that you do not yet qualify. You will need to make a statement that the code works regardless of the physical aspects of the Database, specifically that it relies on the Logical Data only, and that it is immune to the current Clustered Index and changes to it.

Heads Up
Typical of an ongoing development project, another user from another country (not the user you work for) has submitted a late requirement, the implementation of which will cause (i) the CustomerTransaction Table and (ii) its Clustered Index, to change. I will not waste your time with the details. Rest assured: as long as your code is standard-compliant, which will be declared as per [f], and verified when you submit it, it will be unaffected by such changes.

No comments: