Ask: I was assigned to build a sample job in IBM DataStage where it takes a text file from NAS path and validates the data in the text file and if the validations are passed, the data will get transformed, written in a new txt file and placed in the same NAS path and if the validations are failed for even a particular record in the text file, the failed records will get written in a new error text file and an email will be sent to the concerned team. This failure scenario can happen when an empty text file is uploaded. Please guide me in detail about how to create this job in DataStage. This task is given to verify my knowledge on DataStage. I want this job to be created like a real-time production-grade and if possible a BFSI scenario which doesn’t involve transactions. Also I want a high-level visual appealing professional workflow diagram developed for this job

Below is a real production-style DataStage design that exactly matches your ask:

Read a text file from a NAS path
Validate every record
If all validations pass → transform + write a “good” output file back to NAS
If even 1 record fails → write failed records to an error file + send email
Special case: empty file uploaded → treat as failure + email

I’ll frame it as a BFSI non-transaction scenario: Customer Master / KYC enrichment batch (no payments, no transactions).

BFSI scenario (non-transaction) you can say in interview

Input file: CUSTOMER_KYC_UPDATES_YYYYMMDD.txt dropped to NAS by an upstream system.

Each row contains customer updates used for analytics / compliance reporting:

Field	Example	Validation idea
customer_id	1029384756	mandatory, numeric, length
pan	ABCDE1234F	mandatory, PAN pattern
dob	1996-09-16	valid date, not future, age >= 18
mobile	9876543210	10 digits
email	a@b.com	contains `@` + domain
risk_category	LOW/MED/HIGH	allowed values
pincode	560001	6 digits

Transformations (examples):

normalize mobile, uppercase PAN
derive age
mask PAN for output (ABCDE****F)
standardize risk_category

Production-grade architecture in DataStage

You’ll build one Sequence job (orchestrator) + one Parallel job (ETL).

1) Sequence job (Orchestrator)

Name example: SEQ_CUST_KYC_FILE_PROCESS

Purpose:

Detect file arrival + empty file
Run parallel ETL job
Based on result (fail count or empty file), send email
Archive/rename input file safely

Key activities inside the Sequence

Start → Set Variables
- Compute RUN_TS, RUN_ID, file names (good/error), archive name
Check File Exists
- Use Execute Command stage with UNIX: test -f <inputfile>
Check Empty File
- Use UNIX: test -s <inputfile>
- If empty → go to Send Email + create an error marker file (optional)
Run Parallel Job
- JOB_CUST_KYC_VALIDATE_TRANSFORM
Decision: any rejects?
- Based on returned reject count / status
- If rejects > 0 → send email (and keep reject file)
Archive Input File
- Move to /nas/archive/ with timestamp suffix
End

Email sending options (production-friendly)

Execute Command stage using mailx or sendmail (most common on UNIX)
If your org uses enterprise email: call a shell script that hits SMTP relay / API.

2) Parallel job (ETL)

Name example: JOB_CUST_KYC_VALIDATE_TRANSFORM

Stages (typical)

Sequential File (Read)
- Reads input from NAS
- Define schema (fixed-width or delimited)
Transformer (Validate + Transform)
- Outputs to two links:
  - VALID_OUT
  - REJECT_OUT
Sequential File (Write Good Output)
- Writes transformed valid rows to output file on NAS
Sequential File (Write Error Output)
- Writes rejected rows with rejection reason to error file
Row counts / audit
- Capture total/valid/reject counts (for sequence + logs)

How to implement the key requirements (step-by-step)

A) Parameterize everything (must for “production-grade”)

Create a Parameter Set (recommended) or job parameters:

p_in_dir = /nas/in/customer/
p_in_filename = #CUSTOMER_KYC_UPDATES# (or full name)
p_out_dir = /nas/out/customer/
p_err_dir = /nas/out/customer/error/
p_archive_dir = /nas/archive/customer/
p_mail_to = kyc-ops@bank.com
p_mail_cc (optional)
p_env = DEV/UAT/PROD

In Sequence, derive dynamic names:

good_file = CUST_KYC_CLEAN_${RUN_TS}.txt
err_file = CUST_KYC_REJECT_${RUN_TS}.txt

B) Empty file handling (this is where many fail interviews)

In Sequence, do it before running the parallel job:

Execute Command stage command:

Existence check: test -f "${p_in_dir}/${p_in_filename}"
Non-empty check: test -s "${p_in_dir}/${p_in_filename}"

If empty:

Send email: “Empty file received”
Optionally write a 1-line error file like:
- EMPTY_FILE|<filename>|<timestamp>
Archive the empty file (still archive it, to avoid reprocessing)

This is exactly the “real world” behavior BFSI teams expect.

C) Record-level validation and “reject reason”

In your Transformer, create a column like:

reject_reason (string)

Example rules (simple but solid):

customer_id null or non-numeric
PAN not matching [A-Z]{5}[0-9]{4}[A-Z]
DOB invalid / future / age < 18
mobile not 10 digits
risk_category not in (LOW,MED,HIGH)
pincode not 6 digits

Pattern in Transformer (recommended):

Create boolean flags per rule:
- is_pan_valid
- is_dob_valid
- etc.
Create is_record_valid = is_pan_valid AND is_dob_valid AND ...
Build reject_reason by concatenating failures:
- e.g., "PAN_INVALID|DOB_INVALID|MOBILE_INVALID"

Output link constraints:

VALID_OUT constraint: is_record_valid
REJECT_OUT constraint: NOT(is_record_valid)

This gives you record-level rejects without killing the whole job.

D) “If even one record fails → send email”

You still write the clean output file for valid records (as you described), but you must trigger an email if rejects exist.

How to get reject count back to Sequence (practical options):

Option 1 (most common): use job User Status

At end of the Parallel job:

add an Aggregator/Counter logic to compute reject_count
in a final Transformer or a routine, set:
- DSSetUserStatus("REJECT_COUNT=" : String(reject_count) : ";TOTAL=" : String(total_count))

Then in Sequence:

read the job user status string and parse reject_count
Decision stage: if reject_count > 0 → send email

Option 2: write a small “control file”

Parallel job writes:

/nas/out/customer/control/CUST_KYC_CTRL_${RUN_TS}.txt with:
TOTAL=1000
VALID=980
REJECT=20

Sequence reads that file (or just greps it) and decides.

If you want to look “extra production-grade” in interview, pick Option 2 (it’s audit-friendly).

E) Email content (what ops teams like)

Include:

environment
file name
counts
path to reject file
sample reject reasons (top 5)

Example subject:

[UAT][CUST_KYC] File validation failed: 20 rejects - CUSTOMER_KYC_UPDATES_20260128.txt

F) Operational hardening (small things that impress)

Archive input file only after processing
Use timestamped output files to avoid overwrite
Write reject file even if 1 record fails
Add a header/trailer check if your file has it (common in banks)
- trailer might include record count; validate it
Add a job retry logic in sequence (optional)
Add standard log messages + counts

Professional workflow diagram (high-level)

You can paste this into any Mermaid renderer (or even Confluence/GitHub) and it looks clean.

Parallel job internal flow diagram (what the ETL job looks like)

What to tell your evaluator (the “I know DataStage” talking points)

“I used a Sequence to orchestrate file checks, routing, email notifications, and archival.”
“I handled the empty file case before running ETL to avoid false job success.”
“In the parallel job, I split data into valid and reject flows with link constraints, and captured reject_reason for each failed row.”
“I made it production-grade via parameterization, timestamped outputs, audit counts, and deterministic archiving.”

flowchart TB A([Start]) --> B[Set RUN_TS, build filenames, set params] B --> C{Input file exists?} C -- No --> E1[Send Email: File not found] --> Z([End])

C -- Yes --> D{File empty? (size=0)} D -- Yes --> E2[Write error marker (optional)] --> E3[Send Email: Empty file received] --> H[Archive input file] --> Z

D -- No --> F[Run Parallel Job: Validate + Transform] F --> G{Reject count > 0?} G -- Yes --> I[Write reject file already created by job] --> J[Send Email: Validation failed + attach paths] --> H --> Z G -- No --> K[Good output file written] --> H --> Z

flowchart LR R[Sequential File - Read Input] --> T[Transformer - Validate & Transform] T -->|VALID_OUT| W1[Sequential File - Write Good Output] T -->|REJECT_OUT| W2[Sequential File - Write Reject Output (with reason)] T --> C[Row Count/Audit] --> S[Set UserStatus or Write Control File]

DataStage Job Context

BFSI scenario (non-transaction) you can say in interview

Production-grade architecture in DataStage

1) Sequence job (Orchestrator)

Email sending options (production-friendly)

2) Parallel job (ETL)

Stages (typical)

How to implement the key requirements (step-by-step)

A) Parameterize everything (must for “production-grade”)

B) Empty file handling (this is where many fail interviews)

C) Record-level validation and “reject reason”

D) “If even one record fails → send email”

Option 1 (most common): use job User Status

Option 2: write a small “control file”

E) Email content (what ops teams like)

F) Operational hardening (small things that impress)

Professional workflow diagram (high-level)

Parallel job internal flow diagram (what the ETL job looks like)

What to tell your evaluator (the “I know DataStage” talking points)

Comments

More from this blog

Online Movie Ticket Booking Platform Solution Architecture

Deloitte Interview Experience for Senior Consultant — Microservices JAVA (5+ years)

Command Palette

BFSI scenario (non-transaction) you can say in interview

Production-grade architecture in DataStage

1) Sequence job (Orchestrator)

Email sending options (production-friendly)

2) Parallel job (ETL)

Stages (typical)

How to implement the key requirements (step-by-step)

A) Parameterize everything (must for “production-grade”)

B) Empty file handling (this is where many fail interviews)

C) Record-level validation and “reject reason”

D) “If even one record fails → send email”

Option 1 (most common): use job User Status

Option 2: write a small “control file”

E) Email content (what ops teams like)

F) Operational hardening (small things that impress)

Professional workflow diagram (high-level)

Parallel job internal flow diagram (what the ETL job looks like)

What to tell your evaluator (the “I know DataStage” talking points)

Comments

More from this blog