Search This Blog

Wednesday 5 September 2012

Data Warehousing Interview Questions and Answers Part2

What is ER Diagram?

The Entity-Relationship (ER) model was originally proposed by Peter in 1976 [Chen76] as a way to unify the network and relational database views. 


Simply stated the ER model is a conceptual data model that views the real world as entities and relationships. A basic component of the model is the Entity-Relationship diagram which is used to visually represent data objects. 

Since Chen wrote his paper the model has been extended and today it is commonly used for database design for the database designer, the utility of the ER model is: 

it maps well to the relational model. The constructs used in the ER model can easily be transformed into relational tables. it is simple and easy to understand with a minimum of training. Therefore, the model can be used by the database designer to communicate the design to the end user. 

In addition, the model can be used as a design plan by the database developer to implement a data model in specific database management software.


Which columns go to the fact table and which columns go the dimension table?

The Primary Key columns of the Tables (Entities) go to the Dimension Tables as Foreign Keys. 

The Primary Key columns of the Dimension Tables go to the Fact Tables as Foreign Keys.


Here are a number of data modeling tools 


Tool Name Company Name
Erwin Computer Associates
Embarcadero Embarcadero Technologies
Rational Rose IBM Corporation
Power Designer Sybase Corporation
Oracle Designer Oracle Corporation


Name some of modeling tools available in the Market?

These tools are used for Data/dimension modeling 

1. Oracle Designer
2. ERWin (Entity Relationship for windows)
3. Informatica (Cubes/Dimensions)
4. Embarcadero
5. Power Designer Sybase


How do you load the time dimension?
Time dimensions are usually loaded by a program that loops through all possible dates that may appear in the data. It is not unusual for 100 years to be represented in a time dimension, with one row per day.

Explain the advantages of RAID 1, 1/0, and 5. What type of RAID setup would you put your TX logs.

Transaction logs write sequentially and don't need to be read at all. The ideal is to have each on RAID 1/0 because it has much better write performance than RAID 5. 


RAID 1 is also better for TX logs and costs less than 1/0 to implement. It has a tad less reliability and performance is a little worse generally speaking. 

RAID 5 is best for data generally because of cost and the fact it provides great read capability.


What are the various ETL tools in the Market?

Various ETL tools used in market are: 


1. Informatica
2. Data Stage
3. MS-SQL DTS (Integrated Services 2005)
4. Abinitio
5. SQL Loader
6. Sunopsis
7. Oracle Warehouse Builder
8. Data Junction


What is VLDB?

Answer 1:

VLDB stands for Very Large Database. 

It is an environment or storage space managed by a relational database management system (RDBMS) consisting of vast quantities of information. 

Answer 2:
VLDB doesn’t refer to size of database or vast amount of information stored. It refers to the window of opportunity to take back up the database. 

Window of opportunity refers to the time of interval and if the DBA was unable to take back up in the specified time then the database was considered as VLDB.


What are Data Marts?
A data mart is a focused subset of a data warehouse that deals with a single area(like different department) of data and is organized for quick analysis

What are the steps to build the datawarehosue?

Gathering business requirements

Identifying Sources
Identifying Facts
Defining Dimensions
Define Attributes
Redefine Dimensions & Attributes
Organize Attribute Hierarchy & Define Relationship
Assign Unique Identifiers
Additional conventions: Cardinality/Adding ratios


What is Difference between E-R Modeling and Dimensional Modeling.?

Basic diff is E-R modeling will have logical and physical model. Dimensional model will have only physical model. 


E-R modeling is used for normalizing the OLTP database design. 

Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design.


Why fact table is in normal form?

Basically the fact table consists of the Index keys of the dimension/ook up tables and the measures. 


so when ever we have the keys in a table .that itself implies that the table is in the normal form.


What is the advantages data mining over traditional approaches?

Data Mining is used for the estimation of future. For example, if we take a company/business organization, by using the concept of Data Mining, we can predict the future of business interims of Revenue (or) Employees (or) Customers (or) Orders etc. 


Traditional approaches use simple algorithms for estimating the future. But, it does not give accurate results when compared to Data Mining.


What are the various ETL tools in the Market?

Various ETL tools used in market are: 

Informatica
Data Stage
Oracle Warehouse Builder
Ab Initio
Data Junction


What is a CUBE in Datawarehousing concept?
Cubes are logical representation of multidimensional data. The edge of the cube contains dimension members and the body of the cube contains data values.

What are data validation strategies for data mart validation after loading process?

Data validation is to make sure that the loaded data is accurate and meets the business requirements. 


Strategies are different methods followed to meet the validation requirements


What is the data type of the surrogate key ?
Data type of the surrogate key is either integer or numeric or number

What is degenerate dimension table?
Degenerate Dimensions: If a table contains the values, which r neither dimension nor measures is called degenerate dimensions. Ex: invoice id, empno

What is Dimensional Modeling?
Dimensional Modeling is a design concept used by many data warehouse designers to build their datawarehosue. In this design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table contains the facts/measurements of the business and the dimension table contains the context of measurements i.e., the dimensions on which the facts are calculated.

What are the methodologies of Data Warehousing?
Every company has methodology of their own. But to name a few SDLC Methodology, AIM methodology are standard used. Other methodologies are AMM, World class methodology and many more.

What is a linked cube?
Linked cube in which a sub-set of the data can be analyzed into great detail. The linking ensures that the data in the cubes remain consistent.
What is the main difference between Inmon and Kimball philosophies of data warehousing?

Both differed in the concept of building the datawarehosue.. 


According to Kimball ... 

Kimball views data warehousing as a constituency of Data marts. Data marts are focused on delivering business objectives for departments in the organization. And the data warehouse is a conformed dimension of the data marts. Hence a unified view of the enterprise can be obtained from the dimension modeling on a local departmental level. 

Inmon beliefs in creating a data warehouse on a subject-by-subject area basis. Hence the development of the data warehouse can start with data from the online store. Other subject areas can be added to the data warehouse as their needs arise. Point-of-sale (POS) data can be added later if management decides it is necessary. 

i.e., 
Kimball--First DataMarts--Combined way ---Datawarehosue 

Inmon---First Datawarehosue--Later----Data marts


What is Data warehousing Hierarchy?

Hierarchies

Hierarchies are logical structures that use ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation. For example, in a time dimension, a hierarchy might aggregate data from the month level to the quarter level to the year level. A hierarchy can also be used to define a navigational drill path and to establish a family structure. 

Within a hierarchy, each level is logically connected to the levels above and below it. Data values at lower levels aggregate into the data values at higher levels. A dimension can be composed of more than one hierarchy. For example, in the product dimension, there might be two hierarchies--one for product categories and one for product suppliers. 

Dimension hierarchies also group levels from general to granular. Query tools use hierarchies to enable you to drill down into your data to view different levels of granularity. This is one of the key benefits of a data warehouse. 

When designing hierarchies, you must consider the relationships in business structures. For example, a divisional multilevel sales organization. 

Hierarchies impose a family structure on dimension values. For a particular level value, a value at the next higher level is its parent, and values at the next lower level are its children. These familial relationships enable analysts to access data quickly. 

Levels
A level represents a position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the month, quarter, and year levels. Levels range from general to specific, with the root level as the highest or most general level. The levels in a dimension are organized into one or more hierarchies. 

Level Relationships
Level relationships specify top-to-bottom ordering of levels from most general (the root) to most specific information. They define the parent-child relationship between the levels in a hierarchy. 

Hierarchies are also essential components in enabling more complex rewrites. For example, the database can aggregate existing sales revenue on a quarterly base to a yearly aggregation when the dimensional dependencies between quarter and year are known.


What is the main difference between schema in RDBMS and schemas in Datawarehosue....? 

RDBMS Schema

* Used for OLTP systems
* Traditional and old schema
* Normalized
* Difficult to understand and navigate
* cannot solve extract and complex problems
* poorly modeled


DWH Schema
* Used for OLAP systems
* New generation schema
* De Normalized
* Easy to understand and navigate
* Extract and complex problems can be easily solved
* Very good model


What is hybrid slowly changing dimension?

Hybrid SCDs are combination of both SCD 1 and SCD 2. 


It may happen that in a table, some columns are important and we need to track changes for them i.e. capture the historical data for them whereas in some columns even if the data changes, we don't care. 

For such tables we implement Hybrid SCDs, where in some columns are Type 1 and some are Type 2.


What is the different architecture of datawarehosue?

There are two main things 


1. Top down - (bill Inmon)
2.Bottom up - (Ralph kimbol)


1. What is incremental loading? and  What is aggregate fact table?

Incremental loading means loading the ongoing changes in the OLTP. 


Aggregate table contains the [measure] values, aggregated /grouped/summed up to some level of hierarchy.


What is junk dimension? What is the difference between junk dimension and degenerated dimension?

Junk dimension: Grouping of Random flags and text attributes in a dimension and moving them to a separate sub dimension. 


Degenerate Dimension: Keeping the control information on Fact table ex: Consider a Dimension table with fields like order number and order line number and have 1:1 relationship with Fact table, In this case this dimension is removed and the order information will be directly stored in a Fact table in order eliminate unnecessary joins while retrieving order information..


What are the possible data marts in Retail sales?
Product information, sales information

What is the definition of normalized and denormalized view and what are the differences between them?

Normalization is the process of removing redundancies. 


Denormalization is the process of allowing redundancies.


What is meant by metadata in context of a Datawarehosue and how it is important?
Meta data is the data about data; Business Analyst or data modeler usually capture information about data - the source (where and how the data is originated), nature of data (char, varchar, nullable, existence, valid values etc) and behavior of data (how it is modified / derived and the life cycle) in data dictionary a.k.a metadata. Metadata is also presented at the Data mart level, subsets, fact and dimensions, ODS etc. For a DW user, metadata provides vital information for analysis / DSS.

Differences between star and snowflake schemas?

Star schema

A single fact table with N number of Dimension 

Snowflake schema
Any dimensions with extended dimensions are know as snowflake schema


What's the data types present in view? What happens if we implement view in the designer and report?

Three different data types: Dimensions, Measure and Detail. 

View is nothing but an alias and it can be used to resolve the loops in the universe.


Can a dimension table contain numeric values?
Yes. But those data type will be char (only the values can numeric/char)

What is the difference between view and materialized view?

View - store the SQL statement in the database and let you use it as a table. Every time you access the view, the SQL statement executes. 


Materialized view - stores the results of the SQL in table form in the database. SQL statement only executes once and after that every time you run the query, the stored result set is used. Pros include quick query results.


What is surrogate key? Where we use it explain with examples

Surrogate key is a substitution for the natural primary key. 

It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table. 

Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the dimension tables primary keys. They can use Infa sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key. 

It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult. 

Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but ,not only can these change, indexing on a numerical value is probably better and you could consider creating a surrogate key called, say, AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may display only the AIRPORT_NAME. 

2. Adapted from response by Vincent on Thursday, March 13, 2003 
Another benefit you can get from surrogate keys (SID) is : 
Tracking the SCD - Slowly Changing Dimension. 

Let me give you a simple, classical example: 
On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what would be in your Employee Dimension). This employee has a turnover allocated to him on the Business Unit 'BU1' but on the 2nd of June the Employee 'E1' is muted from Business Unit 'BU1' to Business Unit 'BU2.' All the new turnover has to belong to the new Business Unit 'BU2' but the old one should Belong to the Business Unit 'BU1.' 

If you used the natural business key 'E1' for your employee within your datawarehosue everything would be allocated to Business Unit 'BU2' even what actually belongs to 'BU1.' 

If you use surrogate keys, you could create on the 2nd of June a new record for the Employee 'E1' in your Employee Dimension with a new surrogate key. 

This way, in your fact table, you have your old data (before 2nd of June) with the SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID of the employee 'E1' + 'BU2.' 

You could consider Slowly Changing Dimension as an enlargement of your natural key: natural key of the Employee was Employee Code 'E1' but for you it becomes

Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference with the natural key enlargement process is that you might not have all part of your new key within your fact table, so you might not be able to do the join on the new enlarge key -> so you need another id.


What is ER Diagram?

The Entity-Relationship (ER) model was originally proposed by Peter in 1976 [Chen76] as a way to unify the network and relational database views. 


Simply stated the ER model is a conceptual data model that views the real world as entities and relationships. A basic component of the model is the Entity-Relationship diagram which is used to visually represent data objects. 

Since Chen wrote his paper the model has been extended and today it is commonly used for database design for the database designer, the utility of the ER model is: 

it maps well to the relational model. The constructs used in the ER model can easily be transformed into relational tables. it is simple and easy to understand with a minimum of training. Therefore, the model can be used by the database designer to communicate the design to the end user. 

In addition, the model can be used as a design plan by the database developer to implement a data model in specific database management software.


What is aggregate table and aggregate fact table ... any examples of both?

Aggregate table contains summarized data. The materialized views are aggregated tables. 


For ex in sales we have only date transaction. If we want to create a report like sales by product per year. In such cases we aggregate the date vales into week_agg, month_agg, quarter_agg, year_agg. To retrieve date from these tables we use @aggregate function.


What is active data warehousing?
An active data warehouse provides information that enables decision-makers within an organization to manage customer relationships nimbly, efficiently and proactively. Active data warehousing is all about integrating advanced decision support with day-to-day-even minute-to-minute-decision making in a way that increases quality of those customer touches which encourages customer loyalty and thus secure an organization's bottom line. The marketplace is coming of age as we progress from first-generation "passive" decision-support systems to current- and next-generation "active" data warehouse implementations

Why do we override the execute method is struts? Please give me the details?

As part of Struts Framework we can develop the Action Servlet, ActionForm Servlet(here ActionServlet means which class extends the Action class is called ActionServlet and ActionFome means which class extends the ActionForm class is called the Action Form Servlet) and other servlets classes. 


In case of ActionForm class we can develop the validate (). This method will return the ActionErrors object. In this method we can write the validation code. If this method return null or ActionErrors with size=0, the web container will call the execute() as part of the Action class. If it returns size > 0 it will not be call the execute (). It will execute the jsp, Servlet or html file as value for the input attribute as part of the attribute in struts-config.xml file.


What is the difference between Datawarehousing and Business Intelligence?
Datawarehousing deals with all aspects of managing the development, implementation and operation of a data warehouse or data mart including Meta data management, data acquisition, data cleansing, data transformation, storage management, data distribution, data archiving, operational reporting, analytical reporting, security management, backup/recovery planning, etc. Business intelligence, on the other hand, is a set of software tools that enable an organization to analyze measurable aspects of their business such as sales performance, profitability, operational efficiency, effectiveness of marketing campaigns, market penetration among certain customer groups, cost trends, anomalies and exceptions, etc. 
Typically, the term “business intelligence” is used to encompass OLAP, data visualization, data mining and query/reporting tools. Think of the data warehouse as the back office and business intelligence as the entire business including the back office. The business needs the back office on which to function, but the back office without a business to support, makes no sense.

What is the difference between OLAP and datawarehosue?

Data warehouse is the place where the data is stored for analyzing 

where as OLAP is the process of analyzing the data, managing aggregations, 
partitioning information into cubes for in-depth visualization.


What is fact less fact table? Where you have used it in your project?
Factless table means only the key available in the Fact there is no measures available.

Why Denormalization is promoted in Universe Designing?
In a relational data model, for normalization purposes, some lookup tables are not merged as a single table. In a dimensional data modeling (star schema), these tables would be merged as a single table called DIMENSION table for performance and slicing data. Due to this merging of tables into one large Dimension table, it comes out of complex intermediate joins. Dimension tables are directly joined to Fact tables. Though, redundancy of data occurs in DIMENSION table, size of DIMENSION table is 15% only when compared to FACT table. So only Denormalization is promoted in Universe Designing.

What is the difference between ODS and OLTP?

ODS:- It is nothing but a collection of tables created in the Data warehouse that maintains only current data 


where as OLTP maintains the data only for transactions, these are designed for recording daily operations and transactions of a business


What is the difference between data warehouse and BI?
Simply speaking, BI is the capability of analyzing the data of a data warehouse in advantage of that business. A BI tool analyzes the data of a data warehouse and to come into some business decision depending on the result of the analysis.

Is OLAP databases are called decision support system??? True/false?
True

Explain in detail about type 1, type 2(SCD), type 3?

Type-1 

Most Recent Value 
Type-2(full History) 
i) Version Number 
ii) Flag 
iii) Date 
Type-3 
Current and one Perivies value


What is snapshot?
You can disconnect the report from the catalog to which it is attached by saving the report with a snapshot of the data. However, you must reconnect to the catalog if you want to refresh the data.

What is the difference between data warehouse and BI?
Simply speaking, BI is the capability of analyzing the data of a data warehouse in advantage of that business. A BI tool analyzes the data of a data warehouse and to come into some business decision depending on the result of the analysis.

What are non-additive facts in detail?


A fact may be measure, metric or a dollar value. Measure and metric are non additive facts. 


Dollar value is additive fact. If we want to find out the amount for a particular place for a particular period of time, we can add the dollar amounts and come up with the total amount. 


A non additive fact, for e.g. measure height(s) for 'citizens by geographical location' , when we rollup 'city' data to 'state' level data we should not add heights of the citizens rather we may want to use it to derive 'count' .

1 comment:

What is Oracle Database Link and How to create database link (DB Link)

 https://www.oracletutorial.com/oracle-administration/oracle-create-database-link/

Recent Posts