Job Scripts Introduction
Job scripts in Vantage are executable programs that define the computational work to be performed on cluster resources. They serve as the core component of job submission, containing the instructions, commands, and logic needed to complete specific tasks.
What are Job Scripts?
Job scripts are files that contain:
- Execution Commands: The actual programs and commands to run
- Resource Directives: Instructions for the job scheduler about resource requirements
- Environment Setup: Configuration of software environments and dependencies
- Data Handling: Input/output operations and file management
- Error Handling: Logic for managing failures and recovery
Types of Job Scripts
Batch Scripts
Traditional batch processing scripts for non-interactive workloads:
- Sequential Processing: Jobs that run from start to finish without user interaction
- Long-Running Tasks: Computations that may take hours or days to complete
- Resource-Intensive: Jobs requiring significant CPU, memory, or storage resources
- Scheduled Execution: Jobs that run at specific times or intervals
Interactive Scripts
Scripts designed for interactive computing sessions:
- Real-Time Interaction: Jobs that require user input during execution
- Development Workflows: Testing and debugging computational code
- Data Exploration: Interactive analysis and visualization tasks
- Rapid Prototyping: Quick iterations and experimentation
Workflow Scripts
Complex scripts that orchestrate multiple computational steps:
- Multi-Stage Processing: Jobs with sequential or parallel processing stages
- Conditional Logic: Dynamic execution paths based on intermediate results
- Error Recovery: Automatic retry and failure handling mechanisms
- Resource Optimization: Dynamic resource allocation based on workload
Script Languages and Frameworks
Shell Scripts
Traditional bash and shell scripting:
#!/bin/bash
#SBATCH --job-name=my_analysis
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=02:00:00
module load python/3.9
source activate my_env
python analysis.py --input data.csv --output results.txt
Python Scripts
Python-based computational scripts:
#!/usr/bin/env python3
"""
Data processing job script
"""
import sys
import pandas as pd
from pathlib import Path
def main():
input_file = sys.argv[1]
output_file = sys.argv[2]
# Process data
df = pd.read_csv(input_file)
results = df.groupby('category').sum()
results.to_csv(output_file)
if __name__ == "__main__":
main()
R Scripts
Statistical computing and analysis:
#!/usr/bin/env Rscript
# Statistical analysis job script
# Load required libraries
library(dplyr)
library(ggplot2)
# Read command line arguments
args <- commandArgs(trailingOnly = TRUE)
input_file <- args[1]
output_dir <- args[2]
# Perform analysis
data <- read.csv(input_file)
summary_stats <- data %>%
group_by(group) %>%
summarise(mean_value = mean(value),
sd_value = sd(value))
# Save results
write.csv(summary_stats, file.path(output_dir, "summary.csv"))
Container-Based Scripts
Scripts using containerized environments:
#!/bin/bash
#SBATCH --job-name=container_job
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
# Run with Singularity container
singularity exec tensorflow.sif python train_model.py
# Run with Docker (if available)
docker run --rm -v $(pwd):/workspace tensorflow/tensorflow:latest \
python /workspace/train_model.py
Script Structure and Components
Header Section
Resource requirements and job metadata:
#!/bin/bash
#SBATCH --job-name=data_processing
#SBATCH --account=research_project
#SBATCH --partition=compute
#SBATCH --nodes=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=04:00:00
#SBATCH --output=job_%j.out
#SBATCH --error=job_%j.err
Environment Setup
Software environment configuration:
# Load required modules
module purge
module load gcc/9.3.0
module load python/3.9.5
module load cuda/11.2
# Activate virtual environment
source /path/to/venv/bin/activate
# Set environment variables
export OMP_NUM_THREADS=16
export CUDA_VISIBLE_DEVICES=0,1
Main Execution
Core computational logic:
# Create working directory
WORK_DIR="/tmp/job_${SLURM_JOB_ID}"
mkdir -p $WORK_DIR
cd $WORK_DIR
# Copy input files
cp $SLURM_SUBMIT_DIR/input/* .
# Run main computation
python analysis.py --config config.yaml --output results/
# Copy results back
cp -r results/ $SLURM_SUBMIT_DIR/
Error Handling
Robust error management:
# Function to handle errors
handle_error() {
echo "Error occurred in script at line $1"
cleanup_temp_files
exit 1
}
# Set error trap
trap 'handle_error $LINENO' ERR
# Cleanup function
cleanup_temp_files() {
if [ -d "$WORK_DIR" ]; then
rm -rf "$WORK_DIR"
fi
}
# Ensure cleanup on exit
trap cleanup_temp_files EXIT
Best Practices
Resource Management
Optimize resource usage:
- Right-sizing: Request appropriate resources for your workload
- Monitoring: Track actual resource usage vs. requested resources
- Efficiency: Minimize idle time and resource waste
- Scaling: Design scripts to scale with available resources
Robust Error Handling
- Input Validation: Check input files and parameters before processing
- Graceful Failure: Handle errors without corrupting data or resources
- Logging: Capture detailed information about script execution
- Recovery: Implement mechanisms to resume from failures
Performance Optimization
Maximize computational efficiency:
- Parallelization: Use available CPU cores and GPUs effectively
- I/O Optimization: Minimize file system operations and data movement
- Memory Management: Optimize memory usage patterns and avoid leaks
- Profiling: Identify and optimize performance bottlenecks
Reproducibility
Ensure consistent and reproducible results:
- Version Control: Track script versions and changes
- Environment Specification: Document software dependencies and versions
- Parameter Documentation: Clearly document input parameters and options
- Output Validation: Implement checks to verify result correctness
Script Development Workflow
Development Environment
Set up effective development environment:
- Local Testing: Develop and test scripts on local systems
- Version Control: Use git or similar for script management
- Documentation: Maintain clear documentation and comments
- Collaboration: Share scripts with team members for review
Testing Strategy
Implement comprehensive testing:
- Unit Testing: Test individual script components
- Integration Testing: Verify script works with cluster systems
- Performance Testing: Validate resource usage and efficiency
- Regression Testing: Ensure changes don't break existing functionality
Deployment Process
Deploy scripts to production:
- Staging: Test scripts in staging environment
- Validation: Verify script works with production data
- Monitoring: Track script performance and success rates
- Maintenance: Regular updates and optimization
Common Use Cases
Data Processing
Scripts for data transformation and analysis:
- ETL Pipelines: Extract, transform, and load data workflows
- Batch Processing: Large-scale data processing jobs
- Stream Processing: Real-time data processing and analysis
- Data Validation: Quality checks and data integrity verification
Machine Learning
Scripts for ML model development and training:
- Model Training: Training neural networks and ML models
- Hyperparameter Tuning: Automated parameter optimization
- Model Evaluation: Performance testing and validation
- Inference: Large-scale model prediction and scoring
Scientific Computing
Scripts for research and simulation:
- Numerical Simulations: Physics, chemistry, and engineering simulations
- Statistical Analysis: Statistical modeling and hypothesis testing
- Optimization: Mathematical optimization and search algorithms
- Visualization: Data visualization and result presentation
Getting Started
To begin working with job scripts:
- Understand Requirements: Define your computational needs and constraints
- Choose Language: Select appropriate scripting language and frameworks
- Design Script: Plan script structure and execution flow
- Implement Logic: Write and test core computational components
- Test Thoroughly: Validate script functionality and performance
- Deploy and Monitor: Submit jobs and track execution
Next Steps
- Create Job Scripts - Learn how to develop effective job scripts
- Share Job Scripts - Collaborate and share scripts with others
- Delete Job Scripts - Manage script lifecycle and cleanup