Job Scripts Introduction

Job scripts in Vantage are executable programs that define the computational work to be performed on cluster resources. They serve as the core component of job submission, containing the instructions, commands, and logic needed to complete specific tasks.

What are Job Scripts?

Job scripts are files that contain:

Execution Commands: The actual programs and commands to run
Resource Directives: Instructions for the job scheduler about resource requirements
Environment Setup: Configuration of software environments and dependencies
Data Handling: Input/output operations and file management
Error Handling: Logic for managing failures and recovery

Types of Job Scripts

Batch Scripts

Traditional batch processing scripts for non-interactive workloads:

Sequential Processing: Jobs that run from start to finish without user interaction
Long-Running Tasks: Computations that may take hours or days to complete
Resource-Intensive: Jobs requiring significant CPU, memory, or storage resources
Scheduled Execution: Jobs that run at specific times or intervals

Interactive Scripts

Scripts designed for interactive computing sessions:

Real-Time Interaction: Jobs that require user input during execution
Development Workflows: Testing and debugging computational code
Data Exploration: Interactive analysis and visualization tasks
Rapid Prototyping: Quick iterations and experimentation

Workflow Scripts

Complex scripts that orchestrate multiple computational steps:

Multi-Stage Processing: Jobs with sequential or parallel processing stages
Conditional Logic: Dynamic execution paths based on intermediate results
Error Recovery: Automatic retry and failure handling mechanisms
Resource Optimization: Dynamic resource allocation based on workload

Script Languages and Frameworks

Shell Scripts

Traditional bash and shell scripting:

#!/bin/bash
#SBATCH --job-name=my_analysis
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=02:00:00

module load python/3.9
source activate my_env

python analysis.py --input data.csv --output results.txt

Python Scripts

Python-based computational scripts:

#!/usr/bin/env python3
"""
Data processing job script
"""
import sys
import pandas as pd
from pathlib import Path

def main():
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    
    # Process data
    df = pd.read_csv(input_file)
    results = df.groupby('category').sum()
    results.to_csv(output_file)

if __name__ == "__main__":
    main()

R Scripts

Statistical computing and analysis:

#!/usr/bin/env Rscript
# Statistical analysis job script

# Load required libraries
library(dplyr)
library(ggplot2)

# Read command line arguments
args <- commandArgs(trailingOnly = TRUE)
input_file <- args[1]
output_dir <- args[2]

# Perform analysis
data <- read.csv(input_file)
summary_stats <- data %>%
  group_by(group) %>%
  summarise(mean_value = mean(value),
            sd_value = sd(value))

# Save results
write.csv(summary_stats, file.path(output_dir, "summary.csv"))

Container-Based Scripts

Scripts using containerized environments:

#!/bin/bash
#SBATCH --job-name=container_job
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G

# Run with Singularity container
singularity exec tensorflow.sif python train_model.py

# Run with Docker (if available)
docker run --rm -v $(pwd):/workspace tensorflow/tensorflow:latest \
  python /workspace/train_model.py

Script Structure and Components

Header Section

Resource requirements and job metadata:

#!/bin/bash
#SBATCH --job-name=data_processing
#SBATCH --account=research_project
#SBATCH --partition=compute
#SBATCH --nodes=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=04:00:00
#SBATCH --output=job_%j.out
#SBATCH --error=job_%j.err

Environment Setup

Software environment configuration:

# Load required modules
module purge
module load gcc/9.3.0
module load python/3.9.5
module load cuda/11.2

# Activate virtual environment
source /path/to/venv/bin/activate

# Set environment variables
export OMP_NUM_THREADS=16
export CUDA_VISIBLE_DEVICES=0,1

Main Execution

Core computational logic:

# Create working directory
WORK_DIR="/tmp/job_${SLURM_JOB_ID}"
mkdir -p $WORK_DIR
cd $WORK_DIR

# Copy input files
cp $SLURM_SUBMIT_DIR/input/* .

# Run main computation
python analysis.py --config config.yaml --output results/

# Copy results back
cp -r results/ $SLURM_SUBMIT_DIR/

Error Handling

Robust error management:

# Function to handle errors
handle_error() {
    echo "Error occurred in script at line $1"
    cleanup_temp_files
    exit 1
}

# Set error trap
trap 'handle_error $LINENO' ERR

# Cleanup function
cleanup_temp_files() {
    if [ -d "$WORK_DIR" ]; then
        rm -rf "$WORK_DIR"
    fi
}

# Ensure cleanup on exit
trap cleanup_temp_files EXIT

Best Practices

Resource Management

Optimize resource usage:

Right-sizing: Request appropriate resources for your workload
Monitoring: Track actual resource usage vs. requested resources
Efficiency: Minimize idle time and resource waste
Scaling: Design scripts to scale with available resources

Robust Error Handling

Input Validation: Check input files and parameters before processing
Graceful Failure: Handle errors without corrupting data or resources
Logging: Capture detailed information about script execution
Recovery: Implement mechanisms to resume from failures

Performance Optimization

Maximize computational efficiency:

Parallelization: Use available CPU cores and GPUs effectively
I/O Optimization: Minimize file system operations and data movement
Memory Management: Optimize memory usage patterns and avoid leaks
Profiling: Identify and optimize performance bottlenecks

Reproducibility

Ensure consistent and reproducible results:

Version Control: Track script versions and changes
Environment Specification: Document software dependencies and versions
Parameter Documentation: Clearly document input parameters and options
Output Validation: Implement checks to verify result correctness

Script Development Workflow

Development Environment

Set up effective development environment:

Local Testing: Develop and test scripts on local systems
Version Control: Use git or similar for script management
Documentation: Maintain clear documentation and comments
Collaboration: Share scripts with team members for review

Testing Strategy

Implement comprehensive testing:

Unit Testing: Test individual script components
Integration Testing: Verify script works with cluster systems
Performance Testing: Validate resource usage and efficiency
Regression Testing: Ensure changes don't break existing functionality

Deployment Process

Deploy scripts to production:

Staging: Test scripts in staging environment
Validation: Verify script works with production data
Monitoring: Track script performance and success rates
Maintenance: Regular updates and optimization

Common Use Cases

Data Processing

Scripts for data transformation and analysis:

ETL Pipelines: Extract, transform, and load data workflows
Batch Processing: Large-scale data processing jobs
Stream Processing: Real-time data processing and analysis
Data Validation: Quality checks and data integrity verification

Machine Learning

Scripts for ML model development and training:

Model Training: Training neural networks and ML models
Hyperparameter Tuning: Automated parameter optimization
Model Evaluation: Performance testing and validation
Inference: Large-scale model prediction and scoring

Scientific Computing

Scripts for research and simulation:

Numerical Simulations: Physics, chemistry, and engineering simulations
Statistical Analysis: Statistical modeling and hypothesis testing
Optimization: Mathematical optimization and search algorithms
Visualization: Data visualization and result presentation

Getting Started

To begin working with job scripts:

Understand Requirements: Define your computational needs and constraints
Choose Language: Select appropriate scripting language and frameworks
Design Script: Plan script structure and execution flow
Implement Logic: Write and test core computational components
Test Thoroughly: Validate script functionality and performance
Deploy and Monitor: Submit jobs and track execution

Next Steps

Create Job Scripts - Learn how to develop effective job scripts
Share Job Scripts - Collaborate and share scripts with others
Delete Job Scripts - Manage script lifecycle and cleanup

What are Job Scripts?​

Types of Job Scripts​

Batch Scripts​

Interactive Scripts​

Workflow Scripts​

Script Languages and Frameworks​

Shell Scripts​

Python Scripts​

R Scripts​

Container-Based Scripts​

Script Structure and Components​

Header Section​

Environment Setup​

Main Execution​

Error Handling​

Best Practices​

Resource Management​

Robust Error Handling​

Performance Optimization​

Reproducibility​

Script Development Workflow​

Development Environment​

Testing Strategy​

Deployment Process​

Common Use Cases​

Data Processing​

Machine Learning​

Scientific Computing​

Getting Started​

Next Steps​