Wireman: 2005-06-19

Database Learning

Database Normalization (also spelled database normalisation) relates to the level of redundancy in a database's structure. Well-normalized databases have a schema that reflects the true dependencies between tracked quantities. This means that updates can be quickly performed with little risk of data becoming inconsistent.
In the relational model, formal methods exist for quantifying "how normalized" a database is. These classifications are called normal forms (or NFs), and there are algorithms for converting a given database between them.
Any increase in normalization generally involves splitting existing tables into multiple ones, which must be re-joined each time a query is issued. This can sometimes lead to poor performance, so intentional denormalization is used for applications like on-line analytical processing.

Normal Forms
One can only ascribe a normal form to a database if relationships between quantities have been rigorously defined. It is possible to use set theory to express this knowledge once a problem domain has been fully understood. Such formality is usually sidestepped by database designers, who instead model the relationships in terms of an "idealized schema". (The mathematical underpinnings come back into play in proofs regarding the process of transforming from one form to another.)
Edgar F. Codd originally established three normal forms: 1NF, 2NF and 3NF. There are now others that are generally accepted, but 3NF is sufficient for most practical applications. Normalizing beyond that point rarely yields enough benefit to warrant the added complexity.]

First normal form (or 1NF) requires that all column values in a table are atomic (e.g., a number is an atomic value, while a list or a set is not). This basic format eliminates repeating groups and attributes by putting each into a separate table, then connecting them with a primary key-foreign key relationship.
Second normal form (or 2NF) applies to relations that have Composite Primary Keys, where 2 or more attributes comprise the Primary Key. It requires that there are no non-trivial functional dependencies of a non-key attribute on a part (subset) of a candidate key. A relation is said to be in the 2NF if and only if it is in the 1NF and every non-key attribute is irreducibly dependent on the primary key (i.e.,not partially dependent on candidate key).
In order to find if a relation is in 2NF, ask whether any of the non-key attributes of the table could be derived from a subset of the composite key, rather than the whole composite key.

Third normal form (or 3NF) requires that there are no non-trivial functional dependencies of non-key attributes on something other than a superset of a candidate key. A relation is in 3NF if none of the non-Primary Key attributes are a fact about any other non-Primary Key attribute. Another way of saying this is that all non-key attributes are mutually independent (i.e. there should not be transitive dependencies).

Boyce-Codd normal form (or BCNF) requires that there are no non-trivial functional dependencies of attributes on something else than a superset of a candidate key. At this stage, all attributes are dependent on a key, a whole key and nothing but a key (excluding trivial dependencies, like A->A). A relation is said to be in the BCNF if and only if it is in the 3NF and every non-trivial, left-irreducible functional dependency has a candidate key as its determinant. In more informal terms, a relation is in BCNF if it is in 3NF and the only determinants are the candidate keys.

Fourth normal form (or 4NF) requires that there are no non-trivial multi-valued dependencies of attribute sets on something else than a superset of a candidate key. A relation is said to be in 4NF if and only if it is in the BCNF and multi-valued dependencies are functional dependencies. The 4NF removes unwanted data structures: multi-valued dependencies.
Ronald Fagin demonstrated that it is always possible to achieve 4NF (but not always desirable). Rissanen's theorem is also applicable on multi-valued dependencies.

Fifth normal form (or 5NF or PJ/NF) requires that there are no non-trivial join dependencies that do not follow from the key constraints. A relation R is said to be in the 5NF if and only if it is in 4NF and every join dependency in R is implied by the candidate keys.

Join (SQL) combines records from two or more tables in a relational database. In the Structured Query Language (SQL), there are two types of joins: "inner" and "outer". Outer joins are subdivided further into left outer joins, right outer joins, and full outer joins.

Inner and outer joins
Inner join is the default join method if nothing else is specified. An inner join essentially finds the intersection between the two tables. The join example above takes all the records from table A (in this case, employee) and finds the matching record(s) from table B (department). If no match is found, the record from A is not included in the results. If multiple results are found in B that match the predicate then one row will be returned for each (the values from A will be repeated). Special care must be taken when joining tables on columns that can be NULL since NULL values will never match each other. See Left Outer Join or Right Outer Join for a solution.

A left outer join is very different from a inner join. Instead of limiting results to those in both tables, it limits results to those in the "left" table (A). This means that if the ON clause matches 0 records in B, a row in the result will still be returned—but with NULL values for each column from B.

A right outer join is much like a left outer join, except that the tables are reversed. Every record from the right side, B or department, will be returned, and NULL values will be returned for those that have no matching record in A.

Full outer joins are the combination of left and right outer joins. These joins will show records from both tables, and fill in NULLs for missing matches on either side.

Some database systems do not offer this functionality, but it can be emulated through the use of left outer joins and unions.

Implementation
The efficient implementation of joins has been the goal of much work in database systems, because joins are both extremely common but rather difficult to execute efficiently. The difficulty results from the fact that joins are both commutative and associative. In practice, this means that the user merely supplies the list of tables to be joined and the join conditions to be used, and the database system has the task of determining the most efficient way to perform the operation. Determining how to execute a query containing joins is done by the query optimizer. It has two basic freedoms:

join order: because joins are commutative, the order in which tables are joined does not change the final result set of the query. However, join order does have an enormous impact on the cost of the join operation, so choosing the right join order is very important.

join method: given two tables and a join condition, there are multiple algorithms to produce the result set of the join. Which algorithm is most efficient depends on the sizes of the input tables, the number of rows from each table that match the join condition, and the operations required by the rest of the query.
Many join algorithms treat their input tables differently. The input tables are referred to as the outer and inner tables , or left and right, respectively. In the case of nested loops, for example, the entire inner table will be scanned for each row of the outer table .

Query plans involving joins can be classified as:

left-deep: the inner table of each join in the plan is a base table (rather than another join).
right-deep: the outer table of each join in the plan is a base table.
bushy: neither left-deep nor right-deep; both input tables of a join query may be joins themselves.
These names are derived from the appearance of the query if drawn as a tree, with the outer join relation on the left and the inner table on the right (as is the convention).

Join Optimization: See Query optimizer

Join algorithms
There are three fundamental algorithms to perform a join operation.

Nested loops
This is the simplest join algorithm. For each tuple in the outer join relation, the entire inner join relation is scanned, and any tuples that match the join condition are added to the result set. Naturally, this algorithm performs poorly if either the inner or outer join relation is very large.

A refinement to this technique is called "block nested loops": for every block in the outer relation, the entire inner relation is scanned. For each match between the current inner tuple and one of the tuples in the current block of the outer relation, a tuple is added to the join result set. This variant means that more computation is done for each tuple of the inner relation, but far fewer scans of the inner relation are required.

Merge join
If both join relations are sorted by the join attribute, the join can be performed trivially:

For each tuple in the outer relation, consider the current "group" of tuples from the inner relation; a group consists of a set of contiguous tuples in the inner relation with the same value in the join attribute.
For each matching tuple in the current inner group, add a tuple to the join result. Once the inner group has been exhausted, both the inner and outer scans can be advanced to the next group.
This is one reason why many optimizers keep track of the sort order of query nodes — if one or both input relations to a merge join is already sorted on the join attribute, an additional sort is not required. Otherwise, the DBMS will need to perform the sort, usually using an external sort to avoid consuming too much memory.

See also Sort-Merge Join
See Hash Join

A semi join is an efficient join method where first the join attributes of one table are collected and reported to the second one. It was first reported in 1981. It can be improved with a Bloom-Filter (hashing).

New Software Product

Protecting Our Workforce from Themselves One Document at a Time

Engineer vs. Manager

There was a guy lost in a hot air balloon.
He asked a guy on the ground if he could tell him where he was.
The guy responded, "You're fifty feet in the air in a hot air balloon."

The man in the hot air balloon responded, "You must be an engineer."
The man on the ground responded, "That's right, how did you know?"
The man in the balloon said he knew because, "Your answer is correct, but its totally useless."

The man on the ground responded, "You must be a manager."
The man in the balloon responded, "You're right, how did you know?"
The man on the ground said that he knew because, "You're screwed and suddenly its all my fault."

from Christopher who added: this is a joke that i heard already today (its 9:55 AM). I hope you liked it. I thought of you because you seem to be on both sides of the engineer/manager battle that I gather is ongoing. I'll talk to you soon. Love, C

Boulder, CO Vacation with Pat

Wednesday, 5/25/05: We were both awake at 4am and eventually departed Bradley 7am, arrived 12noon in Denver, rented car, drove to Boulder on 470 toll highway - $1.75 every 5 miles and checked into the Pearl Street Inn. Found out from owner, Kate (originally from Glens Falls, NY - A Prynn of the Finch & Prynn Paper Mill), we will be the last customers since after 7?? years of B&B, they are closing. Met the son, Jayson. The existing guest rooms will be converted to 4 beautiful luxury town homes. For more information call Kate Beeman at 303.581.0548 or Jayson T Cote at 303.910.8020

After that suprise we unpacked and desperately needed a walk. We wandered around Pearl Street outdoor mall for the rest of the day and into the night eating and drinking and looking in shops. A great question to ask the helpful people of Boulder - If today was your day off, what would be the hike you would take?

Thursday, 5/26: After a nice breakfast and a stop at the Roosevelt National Forest Boulder Ranger District for an excellent map and directions, we drove into the Rocky Mountain National Park up to the highest point on the highway where due to the altitude "elevation" spoken comes out "evalation." Good thing we didn't bring a bottle of wine for a picnic. The was over 8' of snow still up there. When we got back to Boulder late that afternoon after a coffee stop in Lyons, we walked along the river up to the start of the hills at 6th Street and walked all the way back on Pearl Street and ate at a nice Italian place in the outdoor mall.

Friday, 5/27: back to the park for a hike

Saturday, 5/28: After arrangements with my sister Lisa, we watched family eat breakfast at an IHOP then all drove to the Gold Mine in Idaho Springs which was really excellent. After lunch at Beau Jos for pizza and an aborted tour of the next ghost town, we split up and took David with us for a hike to St. Mary's Glacier where we saw climbers and snowboarders enjoying the steep slopes. I found a small quantity of glacial herb on the way down that found its way to Jerry who was very pleased.

Sunday, 5/29: the Flatirons. The Wedding.

Monday, 5/30: We walked really fast in the BolderBoulder 10K Race and got some great snacks at the end of the race followed by a $1/minute massage. Pat got an extra 5 minutes since she started before me and I didn't want her to have to wait around too long for mine to finish. They left just enough oil on and pointed out the sore spots so when we met Bob and Linda Schwall in CU stadium, there was some self massage going on. Big nap. Dinner at Dave Rudman's house.

Tuesday, 5/31: One more big hike. Its good to know what you really like so that you aren't then confused by all of the other choices.

Wednesday, 6/1: last breakfast @ B&B. Fly back to CT.

Wireman

Thursday, June 23, 2005

Database Learning

New Software Product

Wednesday, June 22, 2005

Engineer vs. Manager

Monday, June 20, 2005

Boulder, CO Vacation with Pat

About Me

Links

Interesting Blogs

Previous Posts

Archives