Data warehouse solutions are typically implemented to provide a central source of reliable data within an organisation. They normally consist of four basic components:
1. Data Warehouse
2. ETL Tool
3. Data Cleansing Tool
4. Reporting/Analysis Tool
Often pulling together data from multiple data sources, a data warehouse aims to provide one version of the truth. All too often, the same data will be held on multiple databases, each with its own inaccuracies. Your finance department invariably will produce a report showing x sales last month, your marketing team will produce a similar report showing a different figure for sales last month, and manufacturing will show yet another figure.
Before you
implement a data warehouse solution, you must be clear on the goals of your data warehouse. What is your data warehouse trying to achieve? Once you have these clear goals written down you can begin to piece together the component parts of a warehouse.
Data Warehouse Design
Dimensional models are based on a series of inter-related Facts and Dimension tables. These tables are often referred to as star schemas due to the occurance of a single fact table with a series dimension tables joined to it in the form of a star. Additional models include snowflake schemas which is a more detailed version of your star schema, consisting of multiple fact tables.
Fact tables typically store measurement data from a particular business process, e.g. Sales. Facts are normally numeric and additive. This is important as fact tables can be created based on aggregations, which is the sum of your additive measures.
Dimension tables contain the textural information for the business, for example, customer name, product names, etc.
Once these models are identified, a data warehouse can be built using a variety of data warehousing tools.
The Data Warehouse Load Process
All data that populates a data warehouse is normally sourced from operational or transactional source systems. These are fed into a data staging area using data warehousing software (ETL software), in a process known as the extract process.
Once in the data staging area, the data is then cleansed, combined with
other data, and possibly, missing data filled in. This data is then loaded into the data presentation area and loaded into a dimensional model.
The data warehouse is then populated; normally using an ETL (Extract, Transform and Load) tool, but can also be done using hand written code. Data cleansing tools are used in the data staging area before the resultant data is moved into the presentation area.
Once the data is in the presentation area, reporting tools can be used either directly, or via a metalayer, to display the data in your report. An example of a metalayer is a Universe from Business Objects.
Additionally, OLAP cubes can be created from data held in the presentation area for reporting. These are highly aggregated ‘mini-warehouses’ often created
to answer specific questions related to a particular area of a business.
Data warehouses are normally updated overnight due to the volumes they need to process. However, it is feasible to make them real time warehouses.
To sumarise, data warehouse solutions normally consist of a database created using a relational model of facts and dimensions, populated using an ETL tool from one or more operational data sources, cleansed using a data cleansing tool, and
loaded into a presentation layer. Reporting and analysis tools complete the data warehouse solution by allowing reports and dashboards to display the information to users.