DataStage Job Context

Ask: I was assigned to build a sample job in IBM DataStage where it takes a text file from NAS path and validates the data in the text file and if the validations are passed, the data will get transformed, written in a new txt file and placed in the same NAS path and if the validations are failed for even a particular record in the text file, the failed records will get written in a new error text file and an email will be sent to the concerned team. This failure scenario can happen when an empty text file is uploaded. Please guide me in detail about how to create this job in DataStage. This task is given to verify my knowledge on DataStage. I want this job to be created like a real-time production-grade and if possible a BFSI scenario which doesn’t involve transactions. Also I want a high-level visual appealing professional workflow diagram developed for this job
Below is a real production-style DataStage design that exactly matches your ask:
- Read a text file from a NAS path
- Validate every record
- If all validations pass → transform + write a “good” output file back to NAS
- If even 1 record fails → write failed records to an error file + send email
- Special case: empty file uploaded → treat as failure + email
I’ll frame it as a BFSI non-transaction scenario: Customer Master / KYC enrichment batch (no payments, no transactions).
BFSI scenario (non-transaction) you can say in interview
Input file: CUSTOMER_KYC_UPDATES_YYYYMMDD.txt dropped to NAS by an upstream system.
Each row contains customer updates used for analytics / compliance reporting:
| Field | Example | Validation idea |
| customer_id | 1029384756 | mandatory, numeric, length |
| pan | ABCDE1234F | mandatory, PAN pattern |
| dob | 1996-09-16 | valid date, not future, age >= 18 |
| mobile | 9876543210 | 10 digits |
| a@b.com | contains @ + domain | |
| risk_category | LOW/MED/HIGH | allowed values |
| pincode | 560001 | 6 digits |
Transformations (examples):
- normalize mobile, uppercase PAN
- derive
age - mask PAN for output (
ABCDE****F) - standardize risk_category
Production-grade architecture in DataStage
You’ll build one Sequence job (orchestrator) + one Parallel job (ETL).
1) Sequence job (Orchestrator)
Name example: SEQ_CUST_KYC_FILE_PROCESS
Purpose:
- Detect file arrival + empty file
- Run parallel ETL job
- Based on result (fail count or empty file), send email
- Archive/rename input file safely
Key activities inside the Sequence
Start → Set Variables
- Compute
RUN_TS,RUN_ID, file names (good/error), archive name
- Compute
Check File Exists
- Use Execute Command stage with UNIX:
test -f <inputfile>
- Use Execute Command stage with UNIX:
Check Empty File
- Use UNIX:
test -s <inputfile> - If empty → go to Send Email + create an error marker file (optional)
- Use UNIX:
Run Parallel Job
JOB_CUST_KYC_VALIDATE_TRANSFORM
Decision: any rejects?
- Based on returned reject count / status
- If rejects > 0 → send email (and keep reject file)
Archive Input File
- Move to
/nas/archive/with timestamp suffix
- Move to
- End
Email sending options (production-friendly)
- Execute Command stage using
mailxorsendmail(most common on UNIX) - If your org uses enterprise email: call a shell script that hits SMTP relay / API.
2) Parallel job (ETL)
Name example: JOB_CUST_KYC_VALIDATE_TRANSFORM
Stages (typical)
Sequential File (Read)
- Reads input from NAS
- Define schema (fixed-width or delimited)
Transformer (Validate + Transform)
Outputs to two links:
VALID_OUTREJECT_OUT
Sequential File (Write Good Output)
- Writes transformed valid rows to output file on NAS
Sequential File (Write Error Output)
- Writes rejected rows with rejection reason to error file
Row counts / audit
- Capture total/valid/reject counts (for sequence + logs)
How to implement the key requirements (step-by-step)
A) Parameterize everything (must for “production-grade”)
Create a Parameter Set (recommended) or job parameters:
p_in_dir=/nas/in/customer/p_in_filename=#CUSTOMER_KYC_UPDATES#(or full name)p_out_dir=/nas/out/customer/p_err_dir=/nas/out/customer/error/p_archive_dir=/nas/archive/customer/p_mail_to=kyc-ops@bank.comp_mail_cc(optional)p_env=DEV/UAT/PROD
In Sequence, derive dynamic names:
good_file = CUST_KYC_CLEAN_${RUN_TS}.txterr_file = CUST_KYC_REJECT_${RUN_TS}.txt
B) Empty file handling (this is where many fail interviews)
In Sequence, do it before running the parallel job:
Execute Command stage command:
- Existence check:
test -f "${p_in_dir}/${p_in_filename}" - Non-empty check:
test -s "${p_in_dir}/${p_in_filename}"
If empty:
- Send email: “Empty file received”
Optionally write a 1-line error file like:
EMPTY_FILE|<filename>|<timestamp>
- Archive the empty file (still archive it, to avoid reprocessing)
This is exactly the “real world” behavior BFSI teams expect.
C) Record-level validation and “reject reason”
In your Transformer, create a column like:
reject_reason(string)
Example rules (simple but solid):
customer_idnull or non-numeric- PAN not matching
[A-Z]{5}[0-9]{4}[A-Z] - DOB invalid / future / age < 18
- mobile not 10 digits
- risk_category not in (
LOW,MED,HIGH) - pincode not 6 digits
Pattern in Transformer (recommended):
Create boolean flags per rule:
is_pan_validis_dob_valid- etc.
- Create
is_record_valid = is_pan_valid AND is_dob_valid AND ... Build
reject_reasonby concatenating failures:- e.g.,
"PAN_INVALID|DOB_INVALID|MOBILE_INVALID"
- e.g.,
Output link constraints:
VALID_OUTconstraint:is_record_validREJECT_OUTconstraint:NOT(is_record_valid)
This gives you record-level rejects without killing the whole job.
D) “If even one record fails → send email”
You still write the clean output file for valid records (as you described), but you must trigger an email if rejects exist.
How to get reject count back to Sequence (practical options):
Option 1 (most common): use job User Status
At end of the Parallel job:
- add an Aggregator/Counter logic to compute reject_count
in a final Transformer or a routine, set:
DSSetUserStatus("REJECT_COUNT=" : String(reject_count) : ";TOTAL=" : String(total_count))
Then in Sequence:
- read the job user status string and parse reject_count
- Decision stage: if
reject_count > 0→ send email
Option 2: write a small “control file”
Parallel job writes:
/nas/out/customer/control/CUST_KYC_CTRL_${RUN_TS}.txtwith:TOTAL=1000VALID=980REJECT=20
Sequence reads that file (or just greps it) and decides.
If you want to look “extra production-grade” in interview, pick Option 2 (it’s audit-friendly).
E) Email content (what ops teams like)
Include:
- environment
- file name
- counts
- path to reject file
- sample reject reasons (top 5)
Example subject:
[UAT][CUST_KYC] File validation failed: 20 rejects - CUSTOMER_KYC_UPDATES_20260128.txt
F) Operational hardening (small things that impress)
- Archive input file only after processing
- Use timestamped output files to avoid overwrite
- Write reject file even if 1 record fails
Add a header/trailer check if your file has it (common in banks)
- trailer might include record count; validate it
- Add a job retry logic in sequence (optional)
- Add standard log messages + counts
Professional workflow diagram (high-level)
You can paste this into any Mermaid renderer (or even Confluence/GitHub) and it looks clean.
Parallel job internal flow diagram (what the ETL job looks like)
What to tell your evaluator (the “I know DataStage” talking points)
- “I used a Sequence to orchestrate file checks, routing, email notifications, and archival.”
- “I handled the empty file case before running ETL to avoid false job success.”
- “In the parallel job, I split data into valid and reject flows with link constraints, and captured reject_reason for each failed row.”
- “I made it production-grade via parameterization, timestamped outputs, audit counts, and deterministic archiving.”
flowchart TB A([Start]) --> B[Set RUN_TS, build filenames, set params] B --> C{Input file exists?} C -- No --> E1[Send Email: File not found] --> Z([End])
C -- Yes --> D{File empty? (size=0)} D -- Yes --> E2[Write error marker (optional)] --> E3[Send Email: Empty file received] --> H[Archive input file] --> Z
D -- No --> F[Run Parallel Job: Validate + Transform] F --> G{Reject count > 0?} G -- Yes --> I[Write reject file already created by job] --> J[Send Email: Validation failed + attach paths] --> H --> Z G -- No --> K[Good output file written] --> H --> Z
flowchart LR R[Sequential File - Read Input] --> T[Transformer - Validate & Transform] T -->|VALID_OUT| W1[Sequential File - Write Good Output] T -->|REJECT_OUT| W2[Sequential File - Write Reject Output (with reason)] T --> C[Row Count/Audit] --> S[Set UserStatus or Write Control File]

