Search This Blog

Wednesday 5 September 2012

Data Warehousing Terminologies and Acronyms

We know several data warehousing terms, here are some important Data warehousing terminologies and concepts that would give fair idea on DW Concepts

* DSS - which stands for Decision Support System
* OLAP - On-line Analytical Processing

* DM which stands for Data Mart

* Dimension – A single set of data about an item described in a fact table, a dimension is usually a denormalized table. A dimension table holds a key value and a numerical measurement or set of related measurements about the fact table object. A measurement is usually a sum but could also be an average, a mean or a variance. A dimension can have many attributes, 50 or more is the norm, since they are denormalized structures.

* Aggregate, aggregation – This refers to the process by which data is summarized over specific periods.
However, there are many more terms that you will need to be familiar with when discussing a data warehouse. Let's look at these before we go on to more advanced topics.

* Bitmap – A special form of index that equates values to bits and then stores the bits in an index. Usually smaller and faster to search than a b*tree

* Clean and Scrub – The process by which data is made ready for insertion into a data warehouse

* Cluster – A data structure in Oracle that stores the cluster key values from several tables in the same physical blocks. This makes retrieval of data from the tables much faster.

* Cluster (2) – A set of machines usually tied together with a high speed interconnect and sharing disk resources

* CUBE – CUBE enables a SELECT statement to calculate subtotals for all possible combinations of a group of dimensions. It also calculates a grand total. This is the set of information typically needed for all cross-tabular reports, so CUBE can calculate a cross-tabular report with a single SELECT statement. Like ROLLUP, CUBE is a simple extension to the GROUP BY clause, and its syntax is also easy to learn.

* Data Mining – The process of discovering data relationships that were previously unknown.

* Data Refresh – The process by which all or part of the data in the warehouse is replaced.

* Data Synchronization – Keeping data in the warehouse synchronized with source data.

* Derived data – Data that isn't sourced, but rather is derived from sourced data such as rollups or cubes

* Dimensional data warehouse – A data warehouse that makes use of the star and snowflake schema design using fact tables and dimension tables.

* Drill down – The process by which more and more detailed information is revealed

* Fact table – The central table of a star or snowflake schema. Usually the fact table is the collection of the key values from the dimension tables and the base facts of the table subject. A fact table is usually normalized.

* Granularity – This defines the level of aggregation in the data warehouse. To fine a level and your users have to do repeated additional aggregation, to course a level and the data becomes meaningless for most users.

* Legacy data – Data that is historical in nature and is usually stored offline

* MPP – Massively parallel processing – Description of a computer with many CPUs , spreads the work over many processors.

* Middleware – Software that makes the interchange of data between users and databases easier

* Mission Critical – A system that if it fails effects the viability of the company

* Parallel query – A process by which a query is broken into multiple subsets to speed execution

* Partition – The process by which a large table or index is split into multiple extents on multiple storage areas to speed processing.

* ROA – Return on Assets

* ROI – Return on investment

* Roll-up – Higher levels of aggregation

* ROLLUP -- ROLLUP enables a SELECT statement to calculate multiple levels of subtotals across a specified group of dimensions. It also calculates a grand total. ROLLUP is a simple extension to the GROUP BY clause, so its syntax is extremely easy to use. The ROLLUP extension is highly efficient, adding minimal overhead to a query.

* Snowflake – A type of data warehouse structure which uses the star structure as a base and then normalizes the associated dimension tables.

* Sparse matrix – A data structure where every intersection is not filled

* Stamp – Can be either a time stamp or a source stamp identifying when data was created or where it came from.

* Standardize – The process by which data from several sources is made to be the same.

* Star- A layout method for a schema in a data warehouse

* Summarization – The process by which data is summarized to present to DSS or DWH users.
Data Warehouse Storage Structures

Data warehouses have several basic storage structures. The structure of a warehouse will depend on how it is to be used. If a data warehouse will be used primarily for rollup and cube type operations it should be in the OLAP structure using fact and dimension tables. If a DWH is primarily used for reviewing trends, looking at standard reports and data screens then a DSS framework of denormalized tables should be used.

Unfortunately many DWH projects attempt to make one structure fit all requirements when in fact many DWH projects should use a synthesis of multiple structures including OLTP, OLAP and DSS.

Many data warehouse projects use STAR and SNOWFLAKE schema designs for their basic layout. These layouts use the "FACT table -- Dimension tables" layout with the SNOWFLAKE having dimension tables that are also FACT tables.

Data warehouses consume a great deal of disk resources. Make sure you increase controllers as you increase disks to prevent IO channel saturation. Spread Oracle DWHs across as many disk resources as possible, especially with partitioned tables and indexes. Avoid RAID5 even though it offers great reliability it is difficult if not impossible to accurately determine file placement. The exception may be with vendors such as EMC that provide high speed anticipatory caching.



What is Oracle Database Link and How to create database link (DB Link)

Recent Posts