# DataStage Job Context Ask: I was assigned to build a sample job in IBM DataStage where it takes a text file from NAS path and validates the data in the text file and if the validations are passed, the data will get transformed, written in a new txt file and placed in the same NAS path and if the validations are failed for even a particular record in the text file, the failed records will get written in a new error text file and an email will be sent to the concerned team. This failure scenario can happen when an empty text file is uploaded. Please guide me in detail about how to create this job in DataStage. This task is given to verify my knowledge on DataStage. I want this job to be created like a real-time production-grade and if possible a BFSI scenario which doesn’t involve transactions. Also I want a high-level visual appealing professional workflow diagram developed for this job Below is a **real production-style DataStage design** that exactly matches your ask: * Read a **text file from a NAS path** * **Validate every record** * If **all validations pass** → transform + write a “good” output file back to NAS * If **even 1 record fails** → write failed records to an **error file** + **send email** * Special case: **empty file uploaded** → treat as failure + email I’ll frame it as a **BFSI non-transaction scenario**: *Customer Master / KYC enrichment batch* (no payments, no transactions). --- ## BFSI scenario (non-transaction) you can say in interview **Input file:** `CUSTOMER_KYC_UPDATES_YYYYMMDD.txt` dropped to NAS by an upstream system. Each row contains customer updates used for analytics / compliance reporting: | Field | Example | Validation idea | | ------------- | ------------------------- | --------------------------------- | | customer_id | 1029384756 | mandatory, numeric, length | | pan | ABCDE1234F | mandatory, PAN pattern | | dob | 1996-09-16 | valid date, not future, age >= 18 | | mobile | 9876543210 | 10 digits | | email | [a@b.com](mailto:a@b.com) | contains `@` + domain | | risk_category | LOW/MED/HIGH | allowed values | | pincode | 560001 | 6 digits | **Transformations (examples):** * normalize mobile, uppercase PAN * derive `age` * mask PAN for output (`ABCDE****F`) * standardize risk_category --- ## Production-grade architecture in DataStage You’ll build **one Sequence job** (orchestrator) + **one Parallel job** (ETL). ### 1) Sequence job (Orchestrator) Name example: `SEQ_CUST_KYC_FILE_PROCESS` **Purpose:** * Detect file arrival + empty file * Run parallel ETL job * Based on result (fail count or empty file), send email * Archive/rename input file safely **Key activities inside the Sequence** 1. **Start → Set Variables** * Compute `RUN_TS`, `RUN_ID`, file names (good/error), archive name 2. **Check File Exists** * Use *Execute Command stage* with UNIX: `test -f ` 3. **Check Empty File** * Use UNIX: `test -s ` * If empty → go to *Send Email* + create an error marker file (optional) 4. **Run Parallel Job** * `JOB_CUST_KYC_VALIDATE_TRANSFORM` 5. **Decision: any rejects?** * Based on returned reject count / status * If rejects > 0 → send email (and keep reject file) 6. **Archive Input File** * Move to `/nas/archive/` with timestamp suffix 7. **End** ### Email sending options (production-friendly) * **Execute Command stage** using `mailx` or `sendmail` (most common on UNIX) * If your org uses enterprise email: call a **shell script** that hits SMTP relay / API. --- ## 2) Parallel job (ETL) Name example: `JOB_CUST_KYC_VALIDATE_TRANSFORM` ### Stages (typical) 1. **Sequential File (Read)** * Reads input from NAS * Define schema (fixed-width or delimited) 2. **Transformer (Validate + Transform)** * Outputs to two links: * `VALID_OUT` * `REJECT_OUT` 3. **Sequential File (Write Good Output)** * Writes transformed valid rows to output file on NAS 4. **Sequential File (Write Error Output)** * Writes rejected rows with rejection reason to error file 5. **Row counts / audit** * Capture total/valid/reject counts (for sequence + logs) --- # How to implement the key requirements (step-by-step) ## A) Parameterize everything (must for “production-grade”) Create a **Parameter Set** (recommended) or job parameters: * `p_in_dir` = `/nas/in/customer/` * `p_in_filename` = `#CUSTOMER_KYC_UPDATES#` (or full name) * `p_out_dir` = `/nas/out/customer/` * `p_err_dir` = `/nas/out/customer/error/` * `p_archive_dir` = `/nas/archive/customer/` * `p_mail_to` = `kyc-ops@bank.com` * `p_mail_cc` (optional) * `p_env` = `DEV/UAT/PROD` In Sequence, derive dynamic names: * `good_file = CUST_KYC_CLEAN_${RUN_TS}.txt` * `err_file = CUST_KYC_REJECT_${RUN_TS}.txt` --- ## B) Empty file handling (this is where many fail interviews) In **Sequence**, do it *before* running the parallel job: **Execute Command stage command:** * Existence check: `test -f "${p_in_dir}/${p_in_filename}"` * Non-empty check: `test -s "${p_in_dir}/${p_in_filename}"` **If empty:** * Send email: “Empty file received” * Optionally write a 1-line error file like: * `EMPTY_FILE||` * Archive the empty file (still archive it, to avoid reprocessing) This is exactly the “real world” behavior BFSI teams expect. --- ## C) Record-level validation and “reject reason” In your **Transformer**, create a column like: * `reject_reason` (string) Example rules (simple but solid): * `customer_id` null or non-numeric * PAN not matching `[A-Z]{5}[0-9]{4}[A-Z]` * DOB invalid / future / age < 18 * mobile not 10 digits * risk_category not in (`LOW`,`MED`,`HIGH`) * pincode not 6 digits **Pattern in Transformer (recommended):** 1. Create boolean flags per rule: * `is_pan_valid` * `is_dob_valid` * etc. 2. Create `is_record_valid = is_pan_valid AND is_dob_valid AND ...` 3. Build `reject_reason` by concatenating failures: * e.g., `"PAN_INVALID|DOB_INVALID|MOBILE_INVALID"` **Output link constraints:** * `VALID_OUT` constraint: `is_record_valid` * `REJECT_OUT` constraint: `NOT(is_record_valid)` This gives you **record-level rejects** without killing the whole job. --- ## D) “If even one record fails → send email” You still write the clean output file for valid records (as you described), **but** you must trigger an email if rejects exist. How to get reject count back to Sequence (practical options): ### Option 1 (most common): use job **User Status** At end of the Parallel job: * add an Aggregator/Counter logic to compute reject_count * in a final Transformer or a routine, set: * `DSSetUserStatus("REJECT_COUNT=" : String(reject_count) : ";TOTAL=" : String(total_count))` Then in Sequence: * read the job user status string and parse reject_count * Decision stage: if `reject_count > 0` → send email ### Option 2: write a small “control file” Parallel job writes: * `/nas/out/customer/control/CUST_KYC_CTRL_${RUN_TS}.txt` with: * `TOTAL=1000` * `VALID=980` * `REJECT=20` Sequence reads that file (or just greps it) and decides. If you want to look “extra production-grade” in interview, pick **Option 2** (it’s audit-friendly). --- ## E) Email content (what ops teams like) Include: * environment * file name * counts * path to reject file * sample reject reasons (top 5) Example subject: * `[UAT][CUST_KYC] File validation failed: 20 rejects - CUSTOMER_KYC_UPDATES_20260128.txt` --- ## F) Operational hardening (small things that impress) * **Archive input file** only after processing * Use **timestamped output files** to avoid overwrite * Write **reject file even if 1 record fails** * Add a **header/trailer check** if your file has it (common in banks) * trailer might include record count; validate it * Add a **job retry** logic in sequence (optional) * Add **standard log messages** + counts --- # Professional workflow diagram (high-level) You can paste this into any Mermaid renderer (or even Confluence/GitHub) and it looks clean. ```mermaid flowchart TB A([Start]) --> B[Set RUN_TS, build filenames, set params] B --> C{Input file exists?} C -- No --> E1[Send Email: File not found] --> Z([End]) C -- Yes --> D{File empty? (size=0)} D -- Yes --> E2[Write error marker (optional)] --> E3[Send Email: Empty file received] --> H[Archive input file] --> Z D -- No --> F[Run Parallel Job: Validate + Transform] F --> G{Reject count > 0?} G -- Yes --> I[Write reject file already created by job] --> J[Send Email: Validation failed + attach paths] --> H --> Z G -- No --> K[Good output file written] --> H --> Z ``` --- # Parallel job internal flow diagram (what the ETL job looks like) ```mermaid flowchart LR R[Sequential File - Read Input] --> T[Transformer - Validate & Transform] T -->|VALID_OUT| W1[Sequential File - Write Good Output] T -->|REJECT_OUT| W2[Sequential File - Write Reject Output (with reason)] T --> C[Row Count/Audit] --> S[Set UserStatus or Write Control File] ``` --- ## What to tell your evaluator (the “I know DataStage” talking points) * “I used a **Sequence** to orchestrate file checks, routing, email notifications, and archival.” * “I handled the **empty file** case *before* running ETL to avoid false job success.” * “In the parallel job, I split data into **valid** and **reject** flows with **link constraints**, and captured **reject_reason** for each failed row.” * “I made it production-grade via **parameterization**, **timestamped outputs**, **audit counts**, and **deterministic archiving**.” --- flowchart TB A([Start]) --> B[Set RUN_TS, build filenames, set params] B --> C{Input file exists?} C -- No --> E1[Send Email: File not found] --> Z([End]) C -- Yes --> D{File empty? (size=0)} D -- Yes --> E2[Write error marker (optional)] --> E3[Send Email: Empty file received] --> H[Archive input file] --> Z D -- No --> F[Run Parallel Job: Validate + Transform] F --> G{Reject count > 0?} G -- Yes --> I[Write reject file already created by job] --> J[Send Email: Validation failed + attach paths] --> H --> Z G -- No --> K[Good output file written] --> H --> Z flowchart LR R[Sequential File - Read Input] --> T[Transformer - Validate & Transform] T -->|VALID_OUT| W1[Sequential File - Write Good Output] T -->|REJECT_OUT| W2[Sequential File - Write Reject Output (with reason)] T --> C[Row Count/Audit] --> S[Set UserStatus or Write Control File]