Skip to main content

Command Palette

Search for a command to run...

DataStage Job Context

Updated
8 min read
DataStage Job Context

Ask: I was assigned to build a sample job in IBM DataStage where it takes a text file from NAS path and validates the data in the text file and if the validations are passed, the data will get transformed, written in a new txt file and placed in the same NAS path and if the validations are failed for even a particular record in the text file, the failed records will get written in a new error text file and an email will be sent to the concerned team. This failure scenario can happen when an empty text file is uploaded. Please guide me in detail about how to create this job in DataStage. This task is given to verify my knowledge on DataStage. I want this job to be created like a real-time production-grade and if possible a BFSI scenario which doesn’t involve transactions. Also I want a high-level visual appealing professional workflow diagram developed for this job

Below is a real production-style DataStage design that exactly matches your ask:

  • Read a text file from a NAS path
  • Validate every record
  • If all validations pass → transform + write a “good” output file back to NAS
  • If even 1 record fails → write failed records to an error file + send email
  • Special case: empty file uploaded → treat as failure + email

I’ll frame it as a BFSI non-transaction scenario: Customer Master / KYC enrichment batch (no payments, no transactions).


BFSI scenario (non-transaction) you can say in interview

Input file: CUSTOMER_KYC_UPDATES_YYYYMMDD.txt dropped to NAS by an upstream system.

Each row contains customer updates used for analytics / compliance reporting:

FieldExampleValidation idea
customer_id1029384756mandatory, numeric, length
panABCDE1234Fmandatory, PAN pattern
dob1996-09-16valid date, not future, age >= 18
mobile987654321010 digits
emaila@b.comcontains @ + domain
risk_categoryLOW/MED/HIGHallowed values
pincode5600016 digits

Transformations (examples):

  • normalize mobile, uppercase PAN
  • derive age
  • mask PAN for output (ABCDE****F)
  • standardize risk_category

Production-grade architecture in DataStage

You’ll build one Sequence job (orchestrator) + one Parallel job (ETL).

1) Sequence job (Orchestrator)

Name example: SEQ_CUST_KYC_FILE_PROCESS

Purpose:

  • Detect file arrival + empty file
  • Run parallel ETL job
  • Based on result (fail count or empty file), send email
  • Archive/rename input file safely

Key activities inside the Sequence

  1. Start → Set Variables

    • Compute RUN_TS, RUN_ID, file names (good/error), archive name
  2. Check File Exists

    • Use Execute Command stage with UNIX: test -f <inputfile>
  3. Check Empty File

    • Use UNIX: test -s <inputfile>
    • If empty → go to Send Email + create an error marker file (optional)
  4. Run Parallel Job

    • JOB_CUST_KYC_VALIDATE_TRANSFORM
  5. Decision: any rejects?

    • Based on returned reject count / status
    • If rejects > 0 → send email (and keep reject file)
  6. Archive Input File

    • Move to /nas/archive/ with timestamp suffix
  7. End

Email sending options (production-friendly)

  • Execute Command stage using mailx or sendmail (most common on UNIX)
  • If your org uses enterprise email: call a shell script that hits SMTP relay / API.

2) Parallel job (ETL)

Name example: JOB_CUST_KYC_VALIDATE_TRANSFORM

Stages (typical)

  1. Sequential File (Read)

    • Reads input from NAS
    • Define schema (fixed-width or delimited)
  2. Transformer (Validate + Transform)

    • Outputs to two links:

      • VALID_OUT
      • REJECT_OUT
  3. Sequential File (Write Good Output)

    • Writes transformed valid rows to output file on NAS
  4. Sequential File (Write Error Output)

    • Writes rejected rows with rejection reason to error file
  5. Row counts / audit

    • Capture total/valid/reject counts (for sequence + logs)

How to implement the key requirements (step-by-step)

A) Parameterize everything (must for “production-grade”)

Create a Parameter Set (recommended) or job parameters:

  • p_in_dir = /nas/in/customer/
  • p_in_filename = #CUSTOMER_KYC_UPDATES# (or full name)
  • p_out_dir = /nas/out/customer/
  • p_err_dir = /nas/out/customer/error/
  • p_archive_dir = /nas/archive/customer/
  • p_mail_to = kyc-ops@bank.com
  • p_mail_cc (optional)
  • p_env = DEV/UAT/PROD

In Sequence, derive dynamic names:

  • good_file = CUST_KYC_CLEAN_${RUN_TS}.txt
  • err_file = CUST_KYC_REJECT_${RUN_TS}.txt

B) Empty file handling (this is where many fail interviews)

In Sequence, do it before running the parallel job:

Execute Command stage command:

  • Existence check: test -f "${p_in_dir}/${p_in_filename}"
  • Non-empty check: test -s "${p_in_dir}/${p_in_filename}"

If empty:

  • Send email: “Empty file received”
  • Optionally write a 1-line error file like:

    • EMPTY_FILE|<filename>|<timestamp>
  • Archive the empty file (still archive it, to avoid reprocessing)

This is exactly the “real world” behavior BFSI teams expect.


C) Record-level validation and “reject reason”

In your Transformer, create a column like:

  • reject_reason (string)

Example rules (simple but solid):

  • customer_id null or non-numeric
  • PAN not matching [A-Z]{5}[0-9]{4}[A-Z]
  • DOB invalid / future / age < 18
  • mobile not 10 digits
  • risk_category not in (LOW,MED,HIGH)
  • pincode not 6 digits

Pattern in Transformer (recommended):

  1. Create boolean flags per rule:

    • is_pan_valid
    • is_dob_valid
    • etc.
  2. Create is_record_valid = is_pan_valid AND is_dob_valid AND ...
  3. Build reject_reason by concatenating failures:

    • e.g., "PAN_INVALID|DOB_INVALID|MOBILE_INVALID"

Output link constraints:

  • VALID_OUT constraint: is_record_valid
  • REJECT_OUT constraint: NOT(is_record_valid)

This gives you record-level rejects without killing the whole job.


D) “If even one record fails → send email”

You still write the clean output file for valid records (as you described), but you must trigger an email if rejects exist.

How to get reject count back to Sequence (practical options):

Option 1 (most common): use job User Status

At end of the Parallel job:

  • add an Aggregator/Counter logic to compute reject_count
  • in a final Transformer or a routine, set:

    • DSSetUserStatus("REJECT_COUNT=" : String(reject_count) : ";TOTAL=" : String(total_count))

Then in Sequence:

  • read the job user status string and parse reject_count
  • Decision stage: if reject_count > 0 → send email

Option 2: write a small “control file”

Parallel job writes:

  • /nas/out/customer/control/CUST_KYC_CTRL_${RUN_TS}.txt with:
  • TOTAL=1000
  • VALID=980
  • REJECT=20

Sequence reads that file (or just greps it) and decides.

If you want to look “extra production-grade” in interview, pick Option 2 (it’s audit-friendly).


E) Email content (what ops teams like)

Include:

  • environment
  • file name
  • counts
  • path to reject file
  • sample reject reasons (top 5)

Example subject:

  • [UAT][CUST_KYC] File validation failed: 20 rejects - CUSTOMER_KYC_UPDATES_20260128.txt

F) Operational hardening (small things that impress)

  • Archive input file only after processing
  • Use timestamped output files to avoid overwrite
  • Write reject file even if 1 record fails
  • Add a header/trailer check if your file has it (common in banks)

    • trailer might include record count; validate it
  • Add a job retry logic in sequence (optional)
  • Add standard log messages + counts

Professional workflow diagram (high-level)

You can paste this into any Mermaid renderer (or even Confluence/GitHub) and it looks clean.


Parallel job internal flow diagram (what the ETL job looks like)


What to tell your evaluator (the “I know DataStage” talking points)

  • “I used a Sequence to orchestrate file checks, routing, email notifications, and archival.”
  • “I handled the empty file case before running ETL to avoid false job success.”
  • “In the parallel job, I split data into valid and reject flows with link constraints, and captured reject_reason for each failed row.”
  • “I made it production-grade via parameterization, timestamped outputs, audit counts, and deterministic archiving.”

flowchart TB A([Start]) --> B[Set RUN_TS, build filenames, set params] B --> C{Input file exists?} C -- No --> E1[Send Email: File not found] --> Z([End])

C -- Yes --> D{File empty? (size=0)} D -- Yes --> E2[Write error marker (optional)] --> E3[Send Email: Empty file received] --> H[Archive input file] --> Z

D -- No --> F[Run Parallel Job: Validate + Transform] F --> G{Reject count > 0?} G -- Yes --> I[Write reject file already created by job] --> J[Send Email: Validation failed + attach paths] --> H --> Z G -- No --> K[Good output file written] --> H --> Z

flowchart LR R[Sequential File - Read Input] --> T[Transformer - Validate & Transform] T -->|VALID_OUT| W1[Sequential File - Write Good Output] T -->|REJECT_OUT| W2[Sequential File - Write Reject Output (with reason)] T --> C[Row Count/Audit] --> S[Set UserStatus or Write Control File]