DataUnited: Coding Standards

Showing posts with label Coding Standards. Show all posts

Sunday, July 29, 2012

Type 2 Slowly Changing Dimension

In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. The newe record gets its own primary key.

In our example, recall we originally have the following table:

Customer Key	Name	State
1001	Christina	Illinois

After Christina moved from Illinois to California, we add the new information as a new row into the table:

Customer Key	Name	State
1001	Christina	Illinois
1005	Christina	California

Advantages:

- This allows us to accurately keep all historical information.

Disadvantages:

- This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern.

- This necessarily complicates the ETL process.

Usage:

About 50% of the time.

When to use Type 2:

Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes.

Type 1 Slowly Changing Dimension

In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no history is kept.

In our example, recall we originally have the following table:

Customer Key	Name	State
1001	Christina	Illinois

After Christina moved from Illinois to California, the new information replaces the new record, and we have the following table:

Customer Key	Name	State
1001	Christina	California

Advantages:

- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information.

Disadvantages:

- All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Christina lived in Illinois before.

- Usage:

About 50% of the time.

When to use Type 1:

Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes.

Saturday, July 28, 2012

Difference between Data cleansing and Data scrubbing ?

Data cleansing is a process of removing errors and resolving inconsistencies in source data before loading data into targets.

Data scrubbing is a process of filtering, merging, decoding and translating the source data to create the validation data for data warehouse.

Tuesday, July 24, 2012

All DW Basics and Concepts

Friday, July 13, 2012

Mappings

The naming convention for mappings which relate to specific systems could be like this: m___

Example:m_trex_rps_create_transaction_stf_format

NAMING CONVENTIONS FOR TRANSFORMATIONS

		seq_DIM_PRODUCT
Source Qualifier Transformation	‘sq_’+ table name	If the source qualifier is defined on table DIM_PRODUCT then the transform should be named as sq_DIM_PRODUCT
Stored Procedure	‘sp_’+ stored procedure name	The stored procedure (sp) transform should be a concatenation of ‘sp_’+ the name of the stored procedurebeing called. NOTE: If a stored procedure name already begins withsp then there is no need to prefix it with ‘sp_’.
Update Strategy Transformation	‘upd_’ + table name for which update strategy is being defined	If the update strategy is being defined for column CUST_NAME then the transform should be named as upd_DIM_CUSTOMER
Union Transformation	‘un_’ + description_of_merge	un_merge_transaction_records
Router Transformation	‘rtr_’+ description_of_routing	rtr_route_wrap_error_valid_records

Ports	Naming convention to be followed
Input Ports	i_ if port is created explicitly
Output Ports	o_ if port is created explicitly
Variable Ports	v_

Workflows

Workflows should being with ‘wf_’ and represent the functionality of the workflow

The naming convention for Workflows is wf___

Example:wf_trex_rps_ create_transaction_stf_format

Mapplets

The naming convention for mapplets is: mplt_Description. Mapplet-Input: input_SourceName Mapplet-output: output_TargetName

Example: mplt_acr_report_generation

For Variables

The naming convention for variables is: v_

1. A variable name should be easy to read and convey the purpose of the variable. This

helps when a number of variables are defined in a single transformation.

2. Variable names should follow the sentence case format i.e. the first letter of the name

will be in upper case.

3. The use of an underscore is recommended as part of a variable name for clarity and

readability.

Example: The variable for first quarter sales should be named as Q1_sales instead of Q1sales.

Variable name	Variable Expression	Description
v_Month	GET_DATE_PART( DATE_ENTERED, ’mm’)	Extract the Month from DATE_ENTERED
v_Q1_sales	SUM(QUANTITY* PRICE-DISCOUNT, Month=1 or Month=2 or Month=3)	Calculate Q1 sales
v_State_counter	IIF ( PREVIOUS_STATE= STATE, State_counter+1,1 )	Increment v_State_counter if STATE in previous data record is same as current data record

Sessions

The name of a session is derived from the mapping it is based upon. By default the session name generated by Informatica will be ‘s_’+mapping name. The default name should be used as the session name. For example: A session based on the mapping m_gsr_dim_geography will be named as s_m_trex_gsr_dim_geography.

The naming convention for sessions is: s_

If one mapping is being used by two sessions then the session name should be suffixed suitably to indicate the

need for using the same mapping for two or more sessions.

Example: s_ m_trex_rps_create_transaction_stf_format

Parameter files

Prm_

Example: Prm_RPS_010_job.prm

Log and Bad files

The naming convention for

Log files : .log Bad files : .bad

Server Variables

The naming conventions for server variables are

$PMRootDir = the root directory (Example: \\Server\C:\Program Files\InformaticaPowerCenter 7.1\Server\).$PMRootDir/SessLogs= Session Logs directory.$PMRootDir/BadFiles =Bad Files directory.$PMRootDir/Cache= Cache Files directory.$PMRootDir/TgtFiles= Target Flat files directory (output).$PMRootDir/SrcFiles= Source Flat files directory (input).$PMRootDir/ExtProc= External Procedures directory.$PMRootDir/LkpFiles= Lookup Files directory.$PMRootDir/WorkflowLogs= Workflow Logs directory.

Performance improvement related standards

1. Turn off verbose logging.

2. Turn off ‘collect performance statistics’, after performance testing.

3. Try not to read a file over network.

4. In places where an opportunity for code reuse is not present, Consider output

expressions in place of variables.

5. Using a Sequence generator provides an edge in performance over a stored

procedure call to get the sequence from database.

6. An Update strategy slows down the performance of the session.

7.Lookups and aggregators slow down the performance since they involve caching. It

isadvisable to calculate and check the Index and data cache size when using them.

Consider partitioning of files where ever necessary.

Miscellaneous standards

1. All ports to be in CAPITAL LETTERS for flat files.

2. Description of all the transformations (including sources and targets) is to

be filled appropriately.

3. There should be no unused ports in any transformations.

4. Connection information of lookup should be $SOURCE or $TARGET.

5. For output only ports, all error (Transformation error) messages must be removed.

6. When Strings are read from a fixed width flat file source and compared,

ensure that the strings are trimmed properly (LTRIM, RTRIM) for blankspaces.

7. Where ever the fields returned from a lookup transformation are used, null

values for these fields should be handled accordingly.

8. When ports are dragged to the successive transformations, the names should

not be changed.

9. Common sources and target tables are to be used from ‘COMMON’ folder.

10. By default “edit null charaters to space” option in session definition for file targets has to be selected.

11. Check the return status of pmcmd. If non-zero then exit.

12. ALL CONTROL REPORTS TO BE IN CAPS - Please modify the necessary scripts. RECORDS PROCESSED: 45 RECORDS REJECTED: 0

Note there is a ":" separating the description and the value in the above example.This should be standard for all control reports.

13. ALL CONTROL REPORTS TO BE IN CAPS

14. The Default values of all the output ports should be removed.

15. All the components in the mapping should have comments.

16.All Ports in should be used and connected.

17. The Tracing level value should be set to 'Terse' in all transformations across the mapping.

18. All the 'Error (Transformation error)' messages should be removed from default value in the Expression transformation.

19.The Mapping should not contain any unused Source/Target Definitions.

20. The Override Tracing should be set to 'TERSE' in the Session properties of Error handling.

21. The 'Enable High Precision' should be disabled in the Session properties of Performance.

22. The Stop on error property should be greater than 0. There maybe exceptions to the rule all such cases should be documented in the design document

23. The Session log file directory, Cache directory, File name, database connection stings, reject file directory, target connection string should start with a $.

24. The port names should be as per Standards.

25. The Data type of ports in source qualifier and source should be matching.

26. The sequence of fields in the sql query of a source qualifier should be in the same order as that of the order of the ports in it.

27. The filter expression should be coded properly in the following format: IIF (condition, True, False).

28. The lookup transformation should not be having fields, which are neither checked as out port nor used as lookup only port.

29.Usage of variables  Variable expressions can reference any input port. So as a matter of good design principle, in a transform, variables should be defined after all the input ports have been defined.  Variables that will be used by other variable(s) should be the first to be defined.  Variables can reference other variable(s) in an expression. NOTE: the ordering of variables in a transform is very important. All the variables using other variables in their expression should be defined in the order of the dependency. In the table above, Month is used to calculate Q1_sales. Hence, Q1_sales must always be defined after Month.

Variables are initialized to 0 for numeric variables or empty string for character/date variables.

Variables can be used for temporary storage (for example, PREVIOUS_STATE is a temporary storage variable which will be overwritten when a new data record is read and processed).

Local variables also reduce the overhead of recalculations. If a complex calculation is to be used throughout a mapping then it is recommended to write the expression once and designate it as a variable. The variable can then be passed to other transforms as an input port. This will increase performance, as the Informatica Server will perform the calculation only once. For example, the variable Q1_sales can be passed from one transformation to another rather than redefining the same formula in different transforms.

Variables can remember values across rows and they retain their value until the next evaluation of the variable expression. Hence, the order of the variables can be used to compute procedural computations. For example, let V1 and V2 be two variables defined with the following expressions and V1 occurs before V2: V1 has the expression ‘V2+1’ V2 has the expression ‘V1+1’ In this case, V1 gets the previous rows’ value for V2 since V2 occurs after V1 and is evaluated after V1. But V2 gets the current rows’ value of V1 since V1 is already evaluated by the time V2 is evaluated.

Informatica can perform data sorting based on a key defined for a source but it would be beneficial to have an index (matching the sort criteria) on a table for better performance. This is especially helpful while performing key based aggregations as well as lookups.

31.An external procedure/function call should always return a value. The return value will be used to determine the next logical step in the mapping execution.

32.Performance of an aggregator can be improved by presorting the data before passing it to the aggregator.

33.Filter data as early as possible in the ETL

34.Use custom SQL in Source Qualifier for the WHERE clause to filter data. This lets the database do the work, where it can be handled the most efficiently

35.A stored procedure should always return a value. The return value will be used by Informatica to evaluate the success or failure of the event. It can also be used to control the outcome of the mapping execution.

36.Stored procedures used inside of a mapping usually create poor performance. Therefore, it is recommended to avoid using stored procedures inside of a mapping.

37. Mapping a string to an integer, or an integer to a string will perform the conversion, however it will be slower than creating an output port with an expression like: to_integer(xxxx) and mapping an integer to an integer. It's because PMServer is left to decide if the conversion can be done mid-stream which seems to slow things down