Thursday, June 23, 2005

Database Learning

Database Normalization (also spelled database normalisation) relates to the level of redundancy in a database's structure. Well-normalized databases have a schema that reflects the true dependencies between tracked quantities. This means that updates can be quickly performed with little risk of data becoming inconsistent.
In the relational model, formal methods exist for quantifying "how normalized" a database is. These classifications are called normal forms (or NFs), and there are algorithms for converting a given database between them.
Any increase in normalization generally involves splitting existing tables into multiple ones, which must be re-joined each time a query is issued. This can sometimes lead to poor performance, so intentional denormalization is used for applications like on-line analytical processing.

Normal Forms
One can only ascribe a normal form to a database if relationships between quantities have been rigorously defined. It is possible to use set theory to express this knowledge once a problem domain has been fully understood. Such formality is usually sidestepped by database designers, who instead model the relationships in terms of an "idealized schema". (The mathematical underpinnings come back into play in proofs regarding the process of transforming from one form to another.)
Edgar F. Codd originally established three normal forms: 1NF, 2NF and 3NF. There are now others that are generally accepted, but 3NF is sufficient for most practical applications. Normalizing beyond that point rarely yields enough benefit to warrant the added complexity.]

First normal form (or 1NF) requires that all column values in a table are atomic (e.g., a number is an atomic value, while a list or a set is not). This basic format eliminates repeating groups and attributes by putting each into a separate table, then connecting them with a primary key-foreign key relationship.
Second normal form (or 2NF) applies to relations that have Composite Primary Keys, where 2 or more attributes comprise the Primary Key. It requires that there are no non-trivial functional dependencies of a non-key attribute on a part (subset) of a candidate key. A relation is said to be in the 2NF if and only if it is in the 1NF and every non-key attribute is irreducibly dependent on the primary key (i.e.,not partially dependent on candidate key).
In order to find if a relation is in 2NF, ask whether any of the non-key attributes of the table could be derived from a subset of the composite key, rather than the whole composite key.

Third normal form (or 3NF) requires that there are no non-trivial functional dependencies of non-key attributes on something other than a superset of a candidate key. A relation is in 3NF if none of the non-Primary Key attributes are a fact about any other non-Primary Key attribute. Another way of saying this is that all non-key attributes are mutually independent (i.e. there should not be transitive dependencies).

Boyce-Codd normal form (or BCNF) requires that there are no non-trivial functional dependencies of attributes on something else than a superset of a candidate key. At this stage, all attributes are dependent on a key, a whole key and nothing but a key (excluding trivial dependencies, like A->A). A relation is said to be in the BCNF if and only if it is in the 3NF and every non-trivial, left-irreducible functional dependency has a candidate key as its determinant. In more informal terms, a relation is in BCNF if it is in 3NF and the only determinants are the candidate keys.

Fourth normal form (or 4NF) requires that there are no non-trivial multi-valued dependencies of attribute sets on something else than a superset of a candidate key. A relation is said to be in 4NF if and only if it is in the BCNF and multi-valued dependencies are functional dependencies. The 4NF removes unwanted data structures: multi-valued dependencies.
Ronald Fagin demonstrated that it is always possible to achieve 4NF (but not always desirable). Rissanen's theorem is also applicable on multi-valued dependencies.

Fifth normal form (or 5NF or PJ/NF) requires that there are no non-trivial join dependencies that do not follow from the key constraints. A relation R is said to be in the 5NF if and only if it is in 4NF and every join dependency in R is implied by the candidate keys.


Join (SQL) combines records from two or more tables in a relational database. In the Structured Query Language (SQL), there are two types of joins: "inner" and "outer". Outer joins are subdivided further into left outer joins, right outer joins, and full outer joins.

Inner and outer joins
Inner join is the default join method if nothing else is specified. An inner join essentially finds the intersection between the two tables. The join example above takes all the records from table A (in this case, employee) and finds the matching record(s) from table B (department). If no match is found, the record from A is not included in the results. If multiple results are found in B that match the predicate then one row will be returned for each (the values from A will be repeated). Special care must be taken when joining tables on columns that can be NULL since NULL values will never match each other. See Left Outer Join or Right Outer Join for a solution.

A left outer join is very different from a inner join. Instead of limiting results to those in both tables, it limits results to those in the "left" table (A). This means that if the ON clause matches 0 records in B, a row in the result will still be returned—but with NULL values for each column from B.

A right outer join is much like a left outer join, except that the tables are reversed. Every record from the right side, B or department, will be returned, and NULL values will be returned for those that have no matching record in A.

Full outer joins are the combination of left and right outer joins. These joins will show records from both tables, and fill in NULLs for missing matches on either side.

Some database systems do not offer this functionality, but it can be emulated through the use of left outer joins and unions.

Implementation
The efficient implementation of joins has been the goal of much work in database systems, because joins are both extremely common but rather difficult to execute efficiently. The difficulty results from the fact that joins are both commutative and associative. In practice, this means that the user merely supplies the list of tables to be joined and the join conditions to be used, and the database system has the task of determining the most efficient way to perform the operation. Determining how to execute a query containing joins is done by the query optimizer. It has two basic freedoms:

join order: because joins are commutative, the order in which tables are joined does not change the final result set of the query. However, join order does have an enormous impact on the cost of the join operation, so choosing the right join order is very important.

join method: given two tables and a join condition, there are multiple algorithms to produce the result set of the join. Which algorithm is most efficient depends on the sizes of the input tables, the number of rows from each table that match the join condition, and the operations required by the rest of the query.
Many join algorithms treat their input tables differently. The input tables are referred to as the outer and inner tables , or left and right, respectively. In the case of nested loops, for example, the entire inner table will be scanned for each row of the outer table .

Query plans involving joins can be classified as:

left-deep: the inner table of each join in the plan is a base table (rather than another join).
right-deep: the outer table of each join in the plan is a base table.
bushy: neither left-deep nor right-deep; both input tables of a join query may be joins themselves.
These names are derived from the appearance of the query if drawn as a tree, with the outer join relation on the left and the inner table on the right (as is the convention).

Join Optimization: See Query optimizer

Join algorithms
There are three fundamental algorithms to perform a join operation.

Nested loops
This is the simplest join algorithm. For each tuple in the outer join relation, the entire inner join relation is scanned, and any tuples that match the join condition are added to the result set. Naturally, this algorithm performs poorly if either the inner or outer join relation is very large.

A refinement to this technique is called "block nested loops": for every block in the outer relation, the entire inner relation is scanned. For each match between the current inner tuple and one of the tuples in the current block of the outer relation, a tuple is added to the join result set. This variant means that more computation is done for each tuple of the inner relation, but far fewer scans of the inner relation are required.

Merge join
If both join relations are sorted by the join attribute, the join can be performed trivially:

For each tuple in the outer relation, consider the current "group" of tuples from the inner relation; a group consists of a set of contiguous tuples in the inner relation with the same value in the join attribute.
For each matching tuple in the current inner group, add a tuple to the join result. Once the inner group has been exhausted, both the inner and outer scans can be advanced to the next group.
This is one reason why many optimizers keep track of the sort order of query nodes — if one or both input relations to a merge join is already sorted on the join attribute, an additional sort is not required. Otherwise, the DBMS will need to perform the sort, usually using an external sort to avoid consuming too much memory.

See also Sort-Merge Join
See Hash Join

A semi join is an efficient join method where first the join attributes of one table are collected and reported to the second one. It was first reported in 1981. It can be improved with a Bloom-Filter (hashing).

0 Comments:

Post a Comment

<< Home