Create Job Submission

Creating effective job submissions is fundamental to successfully running computational workloads on Vantage clusters. This guide covers the complete process from planning to execution.

Planning Your Job Submission

Requirements Analysis

Before creating a submission, analyze your computational needs:

Workload Characteristics

Computation Type: CPU-intensive, memory-intensive, GPU-accelerated, or I/O-heavy
Parallelization: Single-threaded, multi-threaded, or distributed processing
Runtime Estimate: Expected execution time based on data size and complexity
Resource Scaling: How resource requirements scale with input size

Data Requirements

Input Data Size: Volume of data to be processed
Data Location: Where input data is stored and accessibility
Output Size: Expected volume of results and intermediate files
Data Transfer: Network bandwidth requirements for data movement

Software Dependencies

Programming Environment: Python, R, MATLAB, or other languages
Libraries and Modules: Required software packages and versions
Licensing: Software licensing requirements and availability
Containers: Need for containerized environments

Job Submission Methods

Command Line Submission

Direct submission using cluster scheduler commands:

#!/bin/bash
# Basic SLURM job submission

sbatch --job-name=data_analysis \
       --account=research_project \
       --partition=compute \
       --nodes=1 \
       --cpus-per-task=16 \
       --mem=64G \
       --time=08:00:00 \
       --output=logs/analysis_%j.out \
       --error=logs/analysis_%j.err \
       analysis_script.sh

Advanced Submission Scripts

Creating comprehensive submission scripts:

#!/bin/bash
#SBATCH --job-name=ml_training
#SBATCH --account=ml_research
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=256G
#SBATCH --gres=gpu:a100:4
#SBATCH --time=24:00:00
#SBATCH --output=logs/training_%j.out
#SBATCH --error=logs/training_%j.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=researcher@university.edu

# Set strict error handling
set -euo pipefail

# Load required modules
module purge
module load python/3.9
module load cuda/11.8
module load cudnn/8.6

# Set environment variables
export OMP_NUM_THREADS=32
export CUDA_VISIBLE_DEVICES=0,1,2,3
export PYTHONPATH="${PYTHONPATH}:$(pwd)/src"

# Create output directories
mkdir -p logs models checkpoints

# Activate virtual environment
source venv/bin/activate

# Validate input data
if [[ ! -f "data/training_data.h5" ]]; then
    echo "Error: Training data not found" >&2
    exit 1
fi

# Start training with logging
echo "Starting training at $(date)"
echo "Job ID: $SLURM_JOB_ID"
echo "Node: $SLURMD_NODENAME"
echo "GPUs: $CUDA_VISIBLE_DEVICES"

python train_model.py \
    --data data/training_data.h5 \
    --output models/experiment_$(date +%Y%m%d_%H%M%S) \
    --batch-size 128 \
    --epochs 100 \
    --learning-rate 0.001 \
    --checkpoint-freq 10 \
    --gpus 4 \
    --verbose

echo "Training completed at $(date)"

Interactive Job Submission

Creating interactive computing sessions:

# Request interactive session
srun --job-name=interactive_analysis \
     --partition=interactive \
     --cpus-per-task=8 \
     --mem=32G \
     --time=04:00:00 \
     --pty bash

# Alternative: Submit interactive job
sbatch --job-name=jupyter_session \
       --partition=interactive \
       --cpus-per-task=8 \
       --mem=32G \
       --time=08:00:00 \
       jupyter_server.sh

Resource Specification

CPU and Memory Configuration

Optimize CPU and memory allocation:

# CPU-intensive job
#SBATCH --cpus-per-task=64
#SBATCH --mem=128G

# Memory-intensive job  
#SBATCH --cpus-per-task=16
#SBATCH --mem=512G

# Balanced compute job
#SBATCH --cpus-per-task=32
#SBATCH --mem=256G

GPU Resource Allocation

Specify GPU requirements effectively:

# Single GPU job
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G

# Multi-GPU training
#SBATCH --gres=gpu:v100:4
#SBATCH --cpus-per-task=32
#SBATCH --mem=256G

# Specific GPU type
#SBATCH --gres=gpu:a100:2
#SBATCH --constraint=gpu_mem_80gb

Storage and I/O Configuration

Configure storage for optimal performance:

#!/bin/bash
#SBATCH --job-name=data_processing
#SBATCH --tmp=500G  # Local scratch space

# Use local scratch for intensive I/O
SCRATCH_DIR="/tmp/job_${SLURM_JOB_ID}"
mkdir -p "$SCRATCH_DIR"

# Copy input data to scratch
echo "Copying data to local scratch..."
cp -r /shared/input_data/* "$SCRATCH_DIR/"

# Process data on local storage
cd "$SCRATCH_DIR"
python process_data.py --input . --output results/

# Copy results back to shared storage
echo "Copying results to shared storage..."
cp -r results/ "/shared/output/job_${SLURM_JOB_ID}/"

# Cleanup scratch space
rm -rf "$SCRATCH_DIR"

Job Arrays and Parameter Sweeps

Simple Job Arrays

Run multiple similar jobs efficiently:

#!/bin/bash
#SBATCH --job-name=parameter_sweep
#SBATCH --array=1-100
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH --output=logs/sweep_%A_%a.out

# Get array task ID
TASK_ID=$SLURM_ARRAY_TASK_ID

# Define parameter values
LEARNING_RATES=(0.001 0.01 0.1)
BATCH_SIZES=(32 64 128)

# Calculate parameter combination
LR_INDEX=$(( (TASK_ID - 1) % ${#LEARNING_RATES[@]} ))
BS_INDEX=$(( (TASK_ID - 1) / ${#LEARNING_RATES[@]} % ${#BATCH_SIZES[@]} ))

LEARNING_RATE=${LEARNING_RATES[$LR_INDEX]}
BATCH_SIZE=${BATCH_SIZES[$BS_INDEX]}

echo "Task $TASK_ID: LR=$LEARNING_RATE, BS=$BATCH_SIZE"

# Run experiment
python train.py \
    --learning-rate $LEARNING_RATE \
    --batch-size $BATCH_SIZE \
    --output "results/task_${TASK_ID}/"

Advanced Parameter Sweeps

Complex parameter combinations with configuration files:

#!/usr/bin/env python3
"""
Generate parameter sweep configurations
"""

import json
import itertools
from pathlib import Path

def generate_parameter_sweep():
    """Generate parameter combinations for sweep"""
    
    # Define parameter space
    parameters = {
        'learning_rate': [0.001, 0.01, 0.1],
        'batch_size': [32, 64, 128],
        'hidden_layers': [2, 3, 4],
        'dropout_rate': [0.1, 0.2, 0.3]
    }
    
    # Generate all combinations
    param_names = list(parameters.keys())
    param_values = list(parameters.values())
    combinations = list(itertools.product(*param_values))
    
    # Create configuration files
    config_dir = Path("configs")
    config_dir.mkdir(exist_ok=True)
    
    for i, combo in enumerate(combinations, 1):
        config = dict(zip(param_names, combo))
        config_file = config_dir / f"config_{i:03d}.json"
        
        with open(config_file, 'w') as f:
            json.dump(config, f, indent=2)
    
    print(f"Generated {len(combinations)} parameter combinations")
    return len(combinations)

if __name__ == "__main__":
    num_configs = generate_parameter_sweep()

#!/bin/bash
#SBATCH --job-name=config_sweep
#SBATCH --array=1-216  # Total number of configurations
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=04:00:00

# Load configuration for this task
CONFIG_FILE="configs/config_$(printf "%03d" $SLURM_ARRAY_TASK_ID).json"

if [[ ! -f "$CONFIG_FILE" ]]; then
    echo "Configuration file not found: $CONFIG_FILE"
    exit 1
fi

echo "Using configuration: $CONFIG_FILE"

# Run experiment with configuration
python train_with_config.py \
    --config "$CONFIG_FILE" \
    --output "results/experiment_${SLURM_ARRAY_TASK_ID}/"

Workflow and Dependency Management

Sequential Job Dependencies

Chain jobs with dependencies:

#!/bin/bash
# Multi-stage workflow with dependencies

# Stage 1: Data preprocessing
JOB1=$(sbatch --parsable \
              --job-name=preprocess \
              --output=logs/preprocess_%j.out \
              preprocess_data.sh)

echo "Submitted preprocessing job: $JOB1"

# Stage 2: Model training (depends on preprocessing)
JOB2=$(sbatch --parsable \
              --dependency=afterok:$JOB1 \
              --job-name=training \
              --output=logs/training_%j.out \
              train_model.sh)

echo "Submitted training job: $JOB2"

# Stage 3: Model evaluation (depends on training)
JOB3=$(sbatch --parsable \
              --dependency=afterok:$JOB2 \
              --job-name=evaluation \
              --output=logs/evaluation_%j.out \
              evaluate_model.sh)

echo "Submitted evaluation job: $JOB3"

# Stage 4: Results analysis (depends on evaluation)
JOB4=$(sbatch --parsable \
              --dependency=afterok:$JOB3 \
              --job-name=analysis \
              --output=logs/analysis_%j.out \
              analyze_results.sh)

echo "Submitted analysis job: $JOB4"
echo "Workflow submitted: $JOB1 -> $JOB2 -> $JOB3 -> $JOB4"

Parallel Workflow Branches

Create parallel processing workflows:

#!/bin/bash
# Parallel workflow with convergence

# Data preparation
PREP_JOB=$(sbatch --parsable data_preparation.sh)

# Parallel processing branches
BRANCH1=$(sbatch --dependency=afterok:$PREP_JOB --parsable analysis_branch1.sh)
BRANCH2=$(sbatch --dependency=afterok:$PREP_JOB --parsable analysis_branch2.sh)
BRANCH3=$(sbatch --dependency=afterok:$PREP_JOB --parsable analysis_branch3.sh)

# Convergence job (wait for all branches)
MERGE_JOB=$(sbatch --dependency=afterok:$BRANCH1:$BRANCH2:$BRANCH3 \
                   --parsable merge_results.sh)

echo "Parallel workflow:"
echo "  Preparation: $PREP_JOB"
echo "  Branch 1: $BRANCH1"
echo "  Branch 2: $BRANCH2" 
echo "  Branch 3: $BRANCH3"
echo "  Merge: $MERGE_JOB"

Environment and Software Management

Module Loading

Configure software environment:

#!/bin/bash
# Software environment setup

# Clear any existing modules
module purge

# Load core modules
module load gcc/11.2.0
module load python/3.9.7
module load openmpi/4.1.1

# Load domain-specific modules
module load fftw/3.3.10
module load hdf5/1.12.1
module load netcdf/4.8.1

# Load GPU modules (if needed)
if [[ "$SLURM_GRES" == *"gpu"* ]]; then
    module load cuda/11.8
    module load cudnn/8.6.0
    module load nccl/2.12.12
fi

# Display loaded modules
echo "Loaded modules:"
module list

# Set environment variables
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MPI_RANKS_PER_NODE=1
export PYTHONPATH="${PYTHONPATH}:$(pwd)/lib"

# Verify software availability
python --version
gcc --version

Container-Based Jobs

Use containers for consistent environments:

#!/bin/bash
#SBATCH --job-name=container_job
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=04:00:00

# Singularity container execution
singularity exec \
    --bind /data:/data \
    --bind /results:/results \
    tensorflow_latest.sif \
    python /data/train_model.py --output /results/

# Docker execution (if available)
docker run --rm \
    --gpus all \
    -v $(pwd)/data:/workspace/data \
    -v $(pwd)/results:/workspace/results \
    tensorflow/tensorflow:latest-gpu \
    python /workspace/data/train_model.py

Virtual Environment Management

Manage Python virtual environments:

#!/bin/bash
# Python virtual environment setup

# Create virtual environment if it doesn't exist
if [[ ! -d "venv" ]]; then
    echo "Creating virtual environment..."
    python -m venv venv
    
    # Activate and install packages
    source venv/bin/activate
    pip install --upgrade pip
    pip install -r requirements.txt
else
    # Activate existing environment
    source venv/bin/activate
fi

# Verify environment
echo "Python path: $(which python)"
echo "Installed packages:"
pip list

# Run job with activated environment
python analysis_script.py

Monitoring and Debugging

Real-Time Monitoring

Monitor job progress during execution:

#!/bin/bash
# Job monitoring script

monitor_job() {
    local job_id="$1"
    
    echo "Monitoring job $job_id..."
    
    while squeue -j "$job_id" &>/dev/null; do
        # Job status
        echo "=== Job Status $(date) ==="
        squeue -j "$job_id" --format="%.18i %.9P %.20j %.8u %.8T %.10M %.6D %R"
        
        # Resource usage
        echo "=== Resource Usage ==="
        sstat -j "$job_id" --format=JobID,AveCPU,AveRSS,MaxRSS --parsable2 | tail -1
        
        # GPU usage (if applicable)
        if command -v nvidia-smi &>/dev/null; then
            echo "=== GPU Usage ==="
            nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total \
                      --format=csv,noheader,nounits
        fi
        
        echo ""
        sleep 60
    done
    
    # Final job statistics
    echo "=== Final Job Statistics ==="
    sacct -j "$job_id" --format=JobID,JobName,State,ExitCode,Elapsed,MaxRSS,MaxVMSize
}

# Usage: monitor_job <job_id>

Debugging Failed Jobs

Diagnose and troubleshoot job failures:

#!/bin/bash
# Job debugging utility

debug_job() {
    local job_id="$1"
    
    echo "Debugging job $job_id"
    echo "==================="
    
    # Basic job information
    echo "Job Information:"
    scontrol show job "$job_id"
    echo ""
    
    # Job accounting
    echo "Job Accounting:"
    sacct -j "$job_id" --format=JobID,JobName,State,ExitCode,Start,End,Elapsed,MaxRSS,MaxVMSize
    echo ""
    
    # Check output files
    echo "Output Files:"
    find . -name "*${job_id}*" -type f | while read file; do
        echo "File: $file"
        echo "Size: $(stat -c%s "$file") bytes"
        echo "Last 10 lines:"
        tail -10 "$file"
        echo "---"
    done
    
    # Check for common issues
    echo "Common Issues Check:"
    
    # Memory issues
    max_rss=$(sacct -j "$job_id" --format=MaxRSS --noheader --parsable2)
    if [[ "$max_rss" == *"G" ]]; then
        echo "⚠️  High memory usage detected: $max_rss"
    fi
    
    # Time limit issues
    state=$(sacct -j "$job_id" --format=State --noheader --parsable2)
    if [[ "$state" == "TIMEOUT" ]]; then
        echo "⚠️  Job exceeded time limit"
    fi
    
    # Exit code check
    exit_code=$(sacct -j "$job_id" --format=ExitCode --noheader --parsable2)
    if [[ "$exit_code" != "0:0" ]]; then
        echo "⚠️  Non-zero exit code: $exit_code"
    fi
}

Advanced Submission Techniques

Dynamic Resource Allocation

Adjust resources based on input size:

#!/usr/bin/env python3
"""
Dynamic resource calculator
"""

import os
import sys
from pathlib import Path

def calculate_resources(input_file):
    """Calculate optimal resources based on input size"""
    
    file_size = Path(input_file).stat().st_size
    size_gb = file_size / (1024**3)
    
    # Base resource calculations
    if size_gb < 1:
        cpus = 4
        memory = "16G"
        time = "02:00:00"
    elif size_gb < 10:
        cpus = 8
        memory = "32G"
        time = "04:00:00"
    elif size_gb < 100:
        cpus = 16
        memory = "64G"
        time = "08:00:00"
    else:
        cpus = 32
        memory = "128G"
        time = "24:00:00"
    
    return cpus, memory, time

def submit_dynamic_job(input_file, script):
    """Submit job with dynamically calculated resources"""
    
    cpus, memory, time = calculate_resources(input_file)
    
    # Create submission command
    cmd = f"""sbatch \\
        --job-name=dynamic_analysis \\
        --cpus-per-task={cpus} \\
        --mem={memory} \\
        --time={time} \\
        --export=INPUT_FILE={input_file} \\
        {script}"""
    
    print(f"Submitting job with resources: {cpus} CPUs, {memory} memory, {time} time")
    print(f"Command: {cmd}")
    
    # Execute submission
    os.system(cmd)

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python dynamic_submit.py <input_file> <script>")
        sys.exit(1)
    
    submit_dynamic_job(sys.argv[1], sys.argv[2])

Adaptive Job Submission

Implement retry logic and adaptive scheduling:

#!/bin/bash
# Adaptive job submission with retry logic

submit_with_retry() {
    local script="$1"
    local max_attempts=3
    local current_attempt=1
    local base_memory=32
    local base_time="04:00:00"
    
    while [[ $current_attempt -le $max_attempts ]]; do
        echo "Submission attempt $current_attempt of $max_attempts"
        
        # Calculate resources for this attempt
        memory=$((base_memory * current_attempt))
        
        # Adjust time limit for retry attempts
        case $current_attempt in
            1) time="$base_time" ;;
            2) time="08:00:00" ;;
            3) time="24:00:00" ;;
        esac
        
        echo "Using resources: ${memory}G memory, $time time"
        
        # Submit job
        job_id=$(sbatch --parsable \
                       --job-name="adaptive_job_attempt_${current_attempt}" \
                       --mem="${memory}G" \
                       --time="$time" \
                       "$script")
        
        if [[ $? -eq 0 ]]; then
            echo "Job submitted successfully: $job_id"
            
            # Wait for job completion
            while squeue -j "$job_id" &>/dev/null; do
                sleep 60
            done
            
            # Check job outcome
            exit_code=$(sacct -j "$job_id" --format=ExitCode --noheader --parsable2)
            state=$(sacct -j "$job_id" --format=State --noheader --parsable2)
            
            if [[ "$exit_code" == "0:0" && "$state" == "COMPLETED" ]]; then
                echo "Job completed successfully"
                return 0
            else
                echo "Job failed with exit code $exit_code, state $state"
                if [[ "$state" == "OUT_OF_MEMORY" || "$state" == "TIMEOUT" ]]; then
                    echo "Resource issue detected, will retry with increased resources"
                else
                    echo "Non-resource related failure, aborting retries"
                    return 1
                fi
            fi
        else
            echo "Job submission failed"
        fi
        
        ((current_attempt++))
    done
    
    echo "All retry attempts exhausted"
    return 1
}

# Usage
submit_with_retry "analysis_script.sh"

Best Practices Summary

Resource Management

Right-size resources based on actual requirements
Monitor resource usage and adjust accordingly
Use job arrays for parameter sweeps
Implement proper cleanup procedures

Reliability

Include comprehensive error handling
Implement checkpointing for long jobs
Use retry logic for transient failures
Validate inputs before processing

Efficiency

Use local scratch space for I/O intensive operations
Optimize parallel processing strategies
Minimize data movement and copying
Choose appropriate queues and partitions

Monitoring

Implement progress tracking and logging
Monitor resource utilization during execution
Set up notifications for job completion
Debug failures systematically

Next Steps

After mastering job submission:

Delete Job Submissions - Manage job lifecycle
Explore advanced workflow management tools
Optimize submission strategies based on usage patterns

Planning Your Job Submission​

Requirements Analysis​

Workload Characteristics​

Data Requirements​

Software Dependencies​

Job Submission Methods​

Command Line Submission​

Advanced Submission Scripts​

Interactive Job Submission​

Resource Specification​

CPU and Memory Configuration​

GPU Resource Allocation​

Storage and I/O Configuration​

Job Arrays and Parameter Sweeps​

Simple Job Arrays​

Advanced Parameter Sweeps​

Workflow and Dependency Management​

Sequential Job Dependencies​

Parallel Workflow Branches​

Environment and Software Management​

Module Loading​

Container-Based Jobs​

Virtual Environment Management​

Monitoring and Debugging​

Real-Time Monitoring​

Debugging Failed Jobs​

Advanced Submission Techniques​

Dynamic Resource Allocation​

Adaptive Job Submission​

Best Practices Summary​

Resource Management​

Reliability​

Efficiency​

Monitoring​

Next Steps​