Showing posts with label Coding Standards. Show all posts
Showing posts with label Coding Standards. Show all posts

Sunday, July 29, 2012

Type 2 Slowly Changing Dimension


In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. The newe record gets its own primary key.
In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Christina
Illinois
After Christina moved from Illinois to California, we add the new information as a new row into the table:
Customer Key
Name
State
1001
Christina
Illinois
1005
Christina
California
Advantages:
- This allows us to accurately keep all historical information.
Disadvantages:
- This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern.
- This necessarily complicates the ETL process.
Usage:
About 50% of the time.
When to use Type 2:
Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes.

Type 1 Slowly Changing Dimension


In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no history is kept.
In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Christina
Illinois
After Christina moved from Illinois to California, the new information replaces the new record, and we have the following table:
Customer Key
Name
State
1001
Christina
California
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information.
Disadvantages:
-       All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Christina lived in Illinois before.
-       Usage:
About 50% of the time.
When to use Type 1:
Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes.

Saturday, July 28, 2012

Difference between Data cleansing and Data scrubbing ?


Data cleansing is a process of removing errors and resolving inconsistencies in source data before loading data into targets.
Data scrubbing is a process of filtering, merging, decoding and translating the source data to create the validation data for data warehouse.

Friday, July 13, 2012

Mappings



The naming convention for mappings which relate to specific systems could be like this: m___
Example:m_trex_rps_create_transaction_stf_format

NAMING CONVENTIONS FOR TRANSFORMATIONS






seq_DIM_PRODUCT
Source Qualifier Transformation
‘sq_’+ table name
If the source qualifier is defined on table DIM_PRODUCT then the transform should be named as sq_DIM_PRODUCT
Stored Procedure
‘sp_’+ stored procedure name
The stored procedure (sp) transform should be a concatenation of ‘sp_’+ the name of the stored procedurebeing called. NOTE: If a stored procedure name already begins withsp then there is no need to prefix it with ‘sp_’.
Update Strategy Transformation
‘upd_’ + table name for which update strategy is being defined
If the update strategy is being defined for column CUST_NAME then the transform should be named as upd_DIM_CUSTOMER
Union Transformation
‘un_’ + description_of_merge
un_merge_transaction_records
Router Transformation
‘rtr_’+ description_of_routing
rtr_route_wrap_error_valid_records


Ports
Naming convention to be followed
Input Ports
i_ if port is created explicitly
Output Ports
o_ if port is created explicitly
Variable Ports
v_

Workflows


Workflows should being with ‘wf_’ and represent the functionality of the workflow

The naming convention for Workflows is wf___

Example:wf_trex_rps_ create_transaction_stf_format 

Mapplets



The naming convention for mapplets is: mplt_Description. Mapplet-Input: input_SourceName Mapplet-output: output_TargetName


Example: mplt_acr_report_generation

For Variables


The naming convention for variables is: v_




1. A variable name should be easy to read and convey the purpose of the variable. This
    helps when a number of variables are defined in a single transformation.
2. Variable names should follow the sentence case format i.e. the first letter of the name 
     will be in upper case.
3. The use of an underscore is recommended as part of a variable name for clarity and   
     readability.



Example: The variable for first quarter sales should be named as Q1_sales instead of Q1sales.

 

Variable name
Variable Expression
Description
v_Month
GET_DATE_PART( DATE_ENTERED, ’mm’)
Extract the Month from DATE_ENTERED
v_Q1_sales
SUM(QUANTITY* PRICE-DISCOUNT, Month=1 or Month=2 or Month=3)
Calculate Q1 sales
v_State_counter
IIF ( PREVIOUS_STATE= STATE, State_counter+1,1 )
Increment v_State_counter if STATE in previous data record is same as current data record

Sessions

The name of a session is derived from the mapping it is based upon. By default the session name generated by Informatica will be ‘s_’+mapping name. The default name should be used as the session name. For example: A session based on the mapping m_gsr_dim_geography will be named as s_m_trex_gsr_dim_geography.


The naming convention for sessions is: s_




If one mapping is being used by two sessions then the session name should be suffixed suitably to indicate the 
need for using the same mapping for two or more sessions.


Example: s_ m_trex_rps_create_transaction_stf_format

Parameter files


Prm_
Example: Prm_RPS_010_job.prm

Log and Bad files



The naming convention for

Log files : .log Bad files : .bad

Server Variables


The naming conventions for server variables are
$PMRootDir = the root directory (Example: \\Server\C:\Program Files\InformaticaPowerCenter 7.1\Server\).$PMRootDir/SessLogs= Session Logs directory.$PMRootDir/BadFiles =Bad Files directory.$PMRootDir/Cache= Cache Files directory.$PMRootDir/TgtFiles= Target Flat files directory (output).$PMRootDir/SrcFiles= Source Flat files directory (input).$PMRootDir/ExtProc= External Procedures directory.$PMRootDir/LkpFiles= Lookup Files directory.$PMRootDir/WorkflowLogs= Workflow Logs directory.

Performance improvement related standards


1. Turn off verbose logging.
2. Turn off ‘collect performance statistics’, after performance testing.
3. Try not to read a file over network.
4. In places where an opportunity for code reuse is not present, Consider output
    expressions in place of variables.
5. Using a Sequence generator provides an edge in performance over a stored  
     procedure call to get the sequence from database.
6. An Update strategy slows down the performance of the session.
7.Lookups and aggregators slow down the performance since they involve caching. It
    isadvisable to calculate and check the Index and data cache size when using them.
    Consider partitioning of files where ever necessary.

Miscellaneous standards



1.         All ports to be in CAPITAL LETTERS for flat files.
2.         Description of all the transformations (including sources and targets) is to 
            be filled appropriately.
3.        There should be no unused ports in any transformations.
4.        Connection information of lookup should be $SOURCE or $TARGET.
5.         For output only ports, all error (Transformation error) messages must be removed.
6.         When Strings are read from a fixed width flat file source and compared, 
        ensure that the strings are trimmed properly (LTRIM, RTRIM) for blankspaces.
7.       Where ever the fields returned from a lookup transformation are used, null 
          values for these fields should be handled accordingly.
8.      When ports are dragged to the successive transformations, the names should 
          not be changed.
9.       Common sources and target tables are to be used from ‘COMMON’ folder.

10. By default “edit null charaters to space” option in session definition for file targets has to be selected.
11. Check the return status of pmcmd. If non-zero then exit.
12. ALL CONTROL REPORTS TO BE IN CAPS - Please modify the necessary scripts.  RECORDS PROCESSED: 45  RECORDS REJECTED: 0

Note there is a ":" separating the description and the value in the above example.This should be standard for all control reports.

13. ALL CONTROL REPORTS TO BE IN CAPS
14. The Default values of all the output ports should be removed.
15. All the components in the mapping should have comments.
16.All Ports in should be used and connected.

17. The Tracing level value should be set to 'Terse' in all transformations across the mapping.

18. All the 'Error (Transformation error)' messages should be removed from default value in the Expression transformation.

19.The Mapping should not contain any unused Source/Target Definitions.

20. The Override Tracing should be set to 'TERSE' in the Session properties of Error handling.

21. The 'Enable High Precision' should be disabled in the Session properties of Performance.

22. The Stop on error property should be greater than 0. There maybe exceptions to the rule all such cases should be documented in the design document

23. The Session log file directory, Cache directory, File name, database connection stings, reject file directory, target connection string should start with a $.

24. The port names should be as per Standards.

25. The Data type of ports in source qualifier and source should be matching.

26. The sequence of fields in the sql query of a source qualifier should be in the same order as that of the order of the ports in it.

27. The filter expression should be coded properly in the following format: IIF (condition, True, False).

28. The lookup transformation should not be having fields, which are neither checked as out port nor used as lookup only port.

29.Usage of variables  Variable expressions can reference any input port. So as a matter of good design principle, in a transform, variables should be defined after all the input ports have been defined.  Variables that will be used by other variable(s) should be the first to be defined.  Variables can reference other variable(s) in an expression. NOTE: the ordering of variables in a transform is very important. All the variables using other variables in their expression should be defined in the order of the dependency. In the table above, Month is used to calculate Q1_sales. Hence, Q1_sales must always be defined after Month.

Variables are initialized to 0 for numeric variables or empty string for character/date variables.

Variables can be used for temporary storage (for example, PREVIOUS_STATE is a temporary storage variable which will be overwritten when a new data record is read and processed).

Local variables also reduce the overhead of recalculations. If a complex calculation is to be used throughout a mapping then it is recommended to write the expression once and designate it as a variable. The variable can then be passed to other transforms as an input port. This will increase performance, as the Informatica Server will perform the calculation only once. For example, the variable Q1_sales can be passed from one transformation to another rather than redefining the same formula in different transforms.

Variables can remember values across rows and they retain their value until the next evaluation of the variable expression. Hence, the order of the variables can be used to compute procedural computations. For example, let V1 and V2 be two variables defined with the following expressions and V1 occurs before V2: V1 has the expression ‘V2+1’ V2 has the expression ‘V1+1’ In this case, V1 gets the previous rows’ value for V2 since V2 occurs after V1 and is evaluated after V1. But V2 gets the current rows’ value of V1 since V1 is already evaluated by the time V2 is evaluated.


Informatica can perform data sorting based on a key defined for a source but it would be beneficial to have an index (matching the sort criteria) on a table for better performance. This is especially helpful while performing key based aggregations as well as lookups.
31.An external procedure/function call should always return a value. The return value will be used to determine the next logical step in the mapping execution.
32.Performance of an aggregator can be improved by presorting the data before passing it to the aggregator.
33.Filter data as early as possible in the ETL
34.Use custom SQL in Source Qualifier for the WHERE clause to filter data. This lets the database do the work, where it can be handled the most efficiently
35.A stored procedure should always return a value. The return value will be used by Informatica to evaluate the success or failure of the event. It can also be used to control the outcome of the mapping execution.
36.Stored procedures used inside of a mapping usually create poor performance. Therefore, it is recommended to avoid using stored procedures inside of a mapping.
37. Mapping a string to an integer, or an integer to a string will perform the conversion, however it will be slower than creating an output port with an expression like: to_integer(xxxx) and mapping an integer to an integer. It's because PMServer is left to decide if the conversion can be done mid-stream which seems to slow things down