What's A Data Warehouse?
Answer1:
A Data warehouse is a repository of
integrated information, available for queries and analysis. Data and
information are extracted from heterogeneous sources as they are generated.
This makes it much easier and more efficient to run queries over data that
originally came from different sources".
Another definition for data
warehouse is: " A data warehouse is a logical collection of information
gathered from many different operational databases used to create business intelligence
that supports business analysis activities and decision-making tasks,
primarily, a record of an enterprise's past transactional and operational
information, stored in a database designed to favor efficient data analysis and
reporting (especially OLAP)".
Generally, data warehousing is not meant for
current "live" data, although 'virtual' or 'point-to-point' data
warehouses can access operational data. A 'real' data warehouse is generally
preferred to a virtual DW because stored data has been validated and is set up
to provide reliable results to common types of queries used in a business.
Answer2:
Data Warehouse is a repository of integrated information, available for queries
and analysis. Data and information are extracted from heterogeneous sources as
they are generated. This makes it much easier and more efficient to run queries
over data that originally came from different sources.
Typical relational databases are designed for on-line transactional processing
(OLTP) and do not meet the requirements for effective on-line analytical
processing (OLAP). As a result, data warehouses are designed differently than
traditional relational databases.
What is ODS?
1. ODS means Operational Data Store.
2. A collection of operation or bases data that is extracted from operation
databases and standardized, cleansed, consolidated, transformed, and loaded
into enterprise data architecture. An ODS is used to support data mining of
operational data, or as the store for base data that is summarized for a data
warehouse.
The ODS may also be used to audit the data warehouse to assure
summarized and derived data is calculated properly. The ODS may further become
the enterprise shared operational database, allowing operational systems that
are being reengineered to use the ODS as there operation databases.
What is a dimension table?
A dimensional table is a collection of hierarchies and categories along
which the user can drill down and drill up. It contains only the textual
attributes.
What is a lookup table?
A lookup table is the one which is used when updating a warehouse. When the
lookup is placed on the target table (fact table / warehouse) based upon the
primary key of the target, it just updates the table by allowing only new
records or updated records based on the lookup condition.
Why should you put your data warehouse on a different system than your
OLTP system?
Answer1:
An OLTP system is basically “data oriented” (ER model) and not “Subject
oriented" (Dimensional Model). That is why we design a separate system
that will have a subject oriented OLAP system.
Moreover if a complex query is fired on an OLTP system will cause a heavy
overhead on the OLTP server that will affect the day today business directly.
Answer2:
The loading of a warehouse will likely consume a lot of machine resources.
Additionally, users may create queries or reports that are very resource
intensive because of the potentially large amount of data available. Such loads
and resource needs will conflict with the needs of the OLTP systems for
resources and will negatively impact those production systems.
What are Aggregate tables?
Aggregate table contains the summary of existing warehouse data which is
grouped to certain levels of dimensions. Retrieving the required data from the
actual table, which have millions of records will take more time and also
affects the server performance. To avoid this we can aggregate the table to
certain required level and can use it. This tables reduces the load in the
database server and increases the performance of the query and can retrieve the
result very fastly.
What is Dimensional Modeling? Why is it important?
Dimensional Modeling is a design concept used by many data warehouse designers
to build their datawarehosue. In this design model all the data is stored in
two types of tables - Facts table and Dimension table. Fact table contains the
facts/measurements of the business and the dimension table contains the context
of measurements i.e., the dimensions on which the facts are calculated.
Why is Data Modeling Important?
Data modeling is probably the most labor intensive and time consuming part
of the development process. Why bother especially if you are pressed for time?
A common response by practitioners who write on the subject is that you should
no more build a database without a model than you should build a house without
blueprints.
The goal of the data model is to make sure that the all data objects required
by the database are completely and accurately represented. Because the data
model uses easily understood notations and natural language , it can be
reviewed and verified as correct by the end-users.
The data model is also detailed enough to be used by the database developers to
use as a "blueprint" for building the physical database. The
information contained in the data model will be used to define the relational
tables, primary and foreign keys, stored procedures, and triggers. A poorly
designed database will require more time in the long-term. Without careful
planning you may create a database that omits data required to create critical
reports, produces results that are incorrect or inconsistent, and is unable to
accommodate changes in the user's requirements.
What is data mining?
Data mining is a process of extracting hidden trends within a datawarehosue.
For example an insurance data warehouse can be used to mine data for the most
high risk people to insure in a certain geographical area.
What is ETL?
ETL stands for extraction, transformation and loading.
ETL provide developers with an interface for designing source-to-target
mappings, transformation and job control parameter.
· Extraction
Take data from an external source and move it to the warehouse pre-processor
database.
· Transformation
Transform data task allows point-to-point generating, modifying and
transforming data.
· Loading
Load data task adds records to a database table in a warehouse.
What does level of Granularity of a fact table signify?
What does level of Granularity of a fact table signify?
Granularity
The first step in designing a fact table is to determine the granularity of the
fact table. By granularity, we mean the lowest level of information that will
be stored in the fact table. This constitutes two steps:
Determine which dimensions will be included.
Determine where along the hierarchy of each dimension the information will be
kept.
The determining factor usually goes back to the requirements
Difference between
OLTP and OLAP
OLTP
|
OLAP
|
|
User and System Orientation
|
Customer-oriented,
used for data analysis and querying by clerks, clients and IT professionals.
|
market-oriented,
used for data analysis by knowledge workers( managers, executives, analysis)
|
Data Contents
|
Manages current
data, very detail-oriented.
|
Manages large
amounts of historical data, provides facilities for summarization and
aggregation, stores information at different levels of granularity to support
decision making process.
|
Database Design
|
Adopts an entity
relationship(ER) model and an application-oriented database design.
|
Adopts star,
snowflake or fact constellation model and a subject-oriented database design.
|
View
|
Focuses on the
current data within an enterprise or department.
|
spans multiple
versions of a database schema due to the evolutionary process of an organization;
integrates information from many organizational locations and data stores
|
What are SCD1, SCD2, and SCD3?
SCD stands for slowly changing dimensions.
SCD1: only maintained updated values.
Ex: a customer address modified we update existing record with new address.
SCD2: maintaining historical information and current information by using
A) Effective Date
B) Versions
C) Flags
or combination of these
SCD3: by adding new columns to target table we maintain historical information
and current information.
Why are OLTP database designs not generally a good idea for a Data
Warehouse?
Since in OLTP, tables are normalized and hence query response will be slow
for end user and OLTP doesn’t contain years of data and hence cannot be analyzed.
What is BUS Schema?
BUS Schema is composed of a master suite of confirmed dimension and
standardized definition if facts.
What are the various Reporting tools in the Market?
1. MS-Excel
2. Business Objects (Crystal
Reports)
3. Cognos (Impromptu, Power Play)
4. Microstrategy
5. MS reporting services
6. Informatica Power Analyzer
7. Actuate
8. Hyperion (BRIO)
9. Oracle Express OLAP
10. Proclarity
What are Normalization, First Normal Form, Second Normal Form, And Third Normal Form?
1.Normalization is process for assigning attributes to entities–Reduces data
redundancies–Helps eliminate data anomalies–Produces controlled redundancies to
link tables
2.Normalization is the analysis of functional dependency between attributes /
data items of user views? It reduces a complex user view to a set of small ands
table subgroups of fields / relations
1NF:Repeating groups must be eliminated, Dependencies can be identified, All
key attributes defined, No repeating groups in table
2NF: The Table is already in1NF,Includes no partial dependencies–No attribute
dependent on a portion of primary key, Still possible to exhibit transitive dependency,
Attributes may be functionally dependent on non-key attributes
3NF: The Table is already in 2NF, Contains no transitive dependencies
What is Fact table?
Fact Table contains the measurements or metrics or facts of business
process. If your business process is "Sales" , then a measurement of
this business process such as "monthly sales number" is captured in
the Fact table. Fact table also contains the foreign keys for the dimension
tables.
What are conformed dimensions?
Answer1:
Conformed dimensions mean the exact same thing with every possible fact table
to which they are joined Ex: Date Dimensions is connected all facts like Sales facts,
Inventory facts..etc
Answer2:
Conformed dimensions are dimensions which are common to the cubes.(cubes are
the schemas contains facts and dimension tables)
Consider Cube-1 contains F1,D1,D2,D3 and Cube-2 contains F2,D1,D2,D4 are the
Facts and Dimensions here D1,D2 are the Conformed Dimensions
What are the Different methods of loading Dimension tables?
Conventional Load:
Before loading the data, all the Table constraints will be checked against the
data.
Direct load :( Faster Loading)
All the Constraints will be disabled. Data will be loaded directly. Later the
data will be checked against the table constraints and the bad data won't be
indexed.
What is conformed fact?
Conformed dimensions are the dimensions which can be used across multiple
Data Marts in combination with multiple facts tables accordingly
What are Data Marts?
Data Marts are designed to help manager make strategic decisions about their
business.
Data Marts are subset of the corporate-wide data that is of value to a specific
group of users.
There are two types of Data Marts:
1.Independent data marts – sources from data captured form OLTP system,
external providers or from data generated locally within a particular
department or geographic area.
2. Dependent data mart – sources directly form enterprise data warehouses.
What is a level of Granularity of a fact table?
Level of granularity means level of detail that you put into the fact table
in a data warehouse. For example: Based on design you can decide to put the
sales data in each transaction. Now, level of granularity would mean what
detail are you willing to put for each transactional fact. Product sales with
respect to each minute or you want to aggregate it up to minute and put that
data.
How are the Dimension tables designed?
Most dimension tables are designed using Normalization principles upto 2NF.
In some instances they are further normalized to 3NF.
Find where data for this dimension are located.
Figure out how to extract this data.
Determine how to maintain changes to this dimension (see more on this in the
next section).
What are non-additive facts?
Non-Additive: Non-additive facts are facts that cannot be summed up for any
of the dimensions present in the fact table.
What type of Indexing mechanism do we need to use for a typical
datawarehouse?
On the fact table it is best to use bitmap indexes. Dimension tables can use
bitmap and/or the other types of clustered/non-clustered, unique/non-unique
indexes.
To my knowledge, SQLServer does not support bitmap indexes. Only Oracle
supports bitmaps.
What Snow Flake Schema?
Snowflake Schema, each dimension has a primary dimension table, to which one
or more additional dimensions can join. The primary dimension table is the only
table that can join to the fact table.
What is real time data-warehousing?
Real-time data warehousing is a combination of two things: 1) real-time
activity and 2) data warehousing. Real-time activity is activity that is
happening right now. The activity could be anything such as the sale of
widgets. Once the activity is complete, there is data about it.
Data warehousing captures business activity data. Real-time data warehousing
captures business activity data as it occurs. As soon as the business activity
is complete and there is data about it, the completed activity data flows into
the data warehouse and becomes available instantly. In other words, real-time
data warehousing is a framework for deriving information from data as the data
becomes available.
What are slowly changing dimensions?
SCD stands for slowly changing dimensions. Slowly changing dimensions are of
three types
SCD1: only maintained updated values.
Ex: a customer address modified we update existing record with new address.
SCD2: maintaining historical information and current information by using
A) Effective Date
B) Versions
C) Flags
or combination of these
scd3: by adding new columns to target table we maintain historical information
and current information
What are Semi-additive and factless facts and in which scenario will you
use such kinds of fact tables?
Snapshot facts are semi-additive, while we maintain aggregated facts we go
for semi-additive.
EX: Average daily balance
a fact table without numeric fact columns is called factless fact table.
Ex: Promotion Facts
While maintain the promotion values of the transaction (ex: product samples)
because this table doesn’t contain any measures.
Differences between star and snowflake schemas?
Star schema - all dimensions will be linked directly with a fat table.
Snow schema - dimensions maybe interlinked or may have one-to-many relationship
with other tables.
What is a Star Schema?
Star schema is a type of organizing the tables such that we can retrieve the
result from the database easily and fastly in the warehouse environment.
Usually a star schema consists of one or more dimension tables around a fact
table which looks like a star, so that it got its name.
What is a general purpose scheduling tool?
The basic purpose of the scheduling tool in a DW Application is to stream
line the flow of data from Source to Target at specific time or based on some
condition.
Cheers!
Bose
Bose
Thanks for your information. very good article.
ReplyDeleteMicro Strategy Training
Online Micro Strategy Training