Skip to main content

Job Scripts Introduction

Job scripts in Vantage are executable programs that define the computational work to be performed on cluster resources. They serve as the core component of job submission, containing the instructions, commands, and logic needed to complete specific tasks.

What are Job Scripts?

Job scripts are files that contain:

  • Execution Commands: The actual programs and commands to run
  • Resource Directives: Instructions for the job scheduler about resource requirements
  • Environment Setup: Configuration of software environments and dependencies
  • Data Handling: Input/output operations and file management
  • Error Handling: Logic for managing failures and recovery

Types of Job Scripts

Batch Scripts

Traditional batch processing scripts for non-interactive workloads:

  • Sequential Processing: Jobs that run from start to finish without user interaction
  • Long-Running Tasks: Computations that may take hours or days to complete
  • Resource-Intensive: Jobs requiring significant CPU, memory, or storage resources
  • Scheduled Execution: Jobs that run at specific times or intervals

Interactive Scripts

Scripts designed for interactive computing sessions:

  • Real-Time Interaction: Jobs that require user input during execution
  • Development Workflows: Testing and debugging computational code
  • Data Exploration: Interactive analysis and visualization tasks
  • Rapid Prototyping: Quick iterations and experimentation

Workflow Scripts

Complex scripts that orchestrate multiple computational steps:

  • Multi-Stage Processing: Jobs with sequential or parallel processing stages
  • Conditional Logic: Dynamic execution paths based on intermediate results
  • Error Recovery: Automatic retry and failure handling mechanisms
  • Resource Optimization: Dynamic resource allocation based on workload

Script Languages and Frameworks

Shell Scripts

Traditional bash and shell scripting:

#!/bin/bash
#SBATCH --job-name=my_analysis
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=02:00:00

module load python/3.9
source activate my_env

python analysis.py --input data.csv --output results.txt

Python Scripts

Python-based computational scripts:

#!/usr/bin/env python3
"""
Data processing job script
"""
import sys
import pandas as pd
from pathlib import Path

def main():
input_file = sys.argv[1]
output_file = sys.argv[2]

# Process data
df = pd.read_csv(input_file)
results = df.groupby('category').sum()
results.to_csv(output_file)

if __name__ == "__main__":
main()

R Scripts

Statistical computing and analysis:

#!/usr/bin/env Rscript
# Statistical analysis job script

# Load required libraries
library(dplyr)
library(ggplot2)

# Read command line arguments
args <- commandArgs(trailingOnly = TRUE)
input_file <- args[1]
output_dir <- args[2]

# Perform analysis
data <- read.csv(input_file)
summary_stats <- data %>%
group_by(group) %>%
summarise(mean_value = mean(value),
sd_value = sd(value))

# Save results
write.csv(summary_stats, file.path(output_dir, "summary.csv"))

Container-Based Scripts

Scripts using containerized environments:

#!/bin/bash
#SBATCH --job-name=container_job
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G

# Run with Singularity container
singularity exec tensorflow.sif python train_model.py

# Run with Docker (if available)
docker run --rm -v $(pwd):/workspace tensorflow/tensorflow:latest \
python /workspace/train_model.py

Script Structure and Components

Header Section

Resource requirements and job metadata:

#!/bin/bash
#SBATCH --job-name=data_processing
#SBATCH --account=research_project
#SBATCH --partition=compute
#SBATCH --nodes=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=04:00:00
#SBATCH --output=job_%j.out
#SBATCH --error=job_%j.err

Environment Setup

Software environment configuration:

# Load required modules
module purge
module load gcc/9.3.0
module load python/3.9.5
module load cuda/11.2

# Activate virtual environment
source /path/to/venv/bin/activate

# Set environment variables
export OMP_NUM_THREADS=16
export CUDA_VISIBLE_DEVICES=0,1

Main Execution

Core computational logic:

# Create working directory
WORK_DIR="/tmp/job_${SLURM_JOB_ID}"
mkdir -p $WORK_DIR
cd $WORK_DIR

# Copy input files
cp $SLURM_SUBMIT_DIR/input/* .

# Run main computation
python analysis.py --config config.yaml --output results/

# Copy results back
cp -r results/ $SLURM_SUBMIT_DIR/

Error Handling

Robust error management:

# Function to handle errors
handle_error() {
echo "Error occurred in script at line $1"
cleanup_temp_files
exit 1
}

# Set error trap
trap 'handle_error $LINENO' ERR

# Cleanup function
cleanup_temp_files() {
if [ -d "$WORK_DIR" ]; then
rm -rf "$WORK_DIR"
fi
}

# Ensure cleanup on exit
trap cleanup_temp_files EXIT

Best Practices

Resource Management

Optimize resource usage:

  • Right-sizing: Request appropriate resources for your workload
  • Monitoring: Track actual resource usage vs. requested resources
  • Efficiency: Minimize idle time and resource waste
  • Scaling: Design scripts to scale with available resources

Robust Error Handling

  • Input Validation: Check input files and parameters before processing
  • Graceful Failure: Handle errors without corrupting data or resources
  • Logging: Capture detailed information about script execution
  • Recovery: Implement mechanisms to resume from failures

Performance Optimization

Maximize computational efficiency:

  • Parallelization: Use available CPU cores and GPUs effectively
  • I/O Optimization: Minimize file system operations and data movement
  • Memory Management: Optimize memory usage patterns and avoid leaks
  • Profiling: Identify and optimize performance bottlenecks

Reproducibility

Ensure consistent and reproducible results:

  • Version Control: Track script versions and changes
  • Environment Specification: Document software dependencies and versions
  • Parameter Documentation: Clearly document input parameters and options
  • Output Validation: Implement checks to verify result correctness

Script Development Workflow

Development Environment

Set up effective development environment:

  1. Local Testing: Develop and test scripts on local systems
  2. Version Control: Use git or similar for script management
  3. Documentation: Maintain clear documentation and comments
  4. Collaboration: Share scripts with team members for review

Testing Strategy

Implement comprehensive testing:

  1. Unit Testing: Test individual script components
  2. Integration Testing: Verify script works with cluster systems
  3. Performance Testing: Validate resource usage and efficiency
  4. Regression Testing: Ensure changes don't break existing functionality

Deployment Process

Deploy scripts to production:

  1. Staging: Test scripts in staging environment
  2. Validation: Verify script works with production data
  3. Monitoring: Track script performance and success rates
  4. Maintenance: Regular updates and optimization

Common Use Cases

Data Processing

Scripts for data transformation and analysis:

  • ETL Pipelines: Extract, transform, and load data workflows
  • Batch Processing: Large-scale data processing jobs
  • Stream Processing: Real-time data processing and analysis
  • Data Validation: Quality checks and data integrity verification

Machine Learning

Scripts for ML model development and training:

  • Model Training: Training neural networks and ML models
  • Hyperparameter Tuning: Automated parameter optimization
  • Model Evaluation: Performance testing and validation
  • Inference: Large-scale model prediction and scoring

Scientific Computing

Scripts for research and simulation:

  • Numerical Simulations: Physics, chemistry, and engineering simulations
  • Statistical Analysis: Statistical modeling and hypothesis testing
  • Optimization: Mathematical optimization and search algorithms
  • Visualization: Data visualization and result presentation

Getting Started

To begin working with job scripts:

  1. Understand Requirements: Define your computational needs and constraints
  2. Choose Language: Select appropriate scripting language and frameworks
  3. Design Script: Plan script structure and execution flow
  4. Implement Logic: Write and test core computational components
  5. Test Thoroughly: Validate script functionality and performance
  6. Deploy and Monitor: Submit jobs and track execution

Next Steps