Submit Job Script
Submitting job scripts to Vantage clusters initiates computational work execution. This guide covers the submission process, optimization strategies, and monitoring techniques for successful job execution on high-performance computing resources.
Overview
Job script submission is the critical transition from development to execution, involving:
- Resource Allocation: Requesting and securing computational resources from the cluster scheduler
- Queue Management: Understanding job prioritization and scheduling policies
- Execution Monitoring: Tracking job progress and performance during runtime
- Error Handling: Managing failures and implementing recovery strategies
- Result Collection: Retrieving outputs and analyzing computational results
- Performance Optimization: Tuning resource usage and execution efficiency
Submission Process
Basic Submission
Submit a job script for execution:
# Submit script to default queue
vantage job submit my-script.py
# Submit with custom job name
vantage job submit my-script.py --name "Data Analysis Job"
# Submit to specific queue
vantage job submit my-script.py --queue gpu-queue
Resource Specification
Define computational resource requirements:
# Specify CPU and memory requirements
vantage job submit analysis.py \
--cpus 8 \
--memory 32GB \
--time 4:00:00
# Request GPU resources
vantage job submit ml-training.py \
--gpus 2 \
--gpu-type V100 \
--memory 64GB
# Specify storage requirements
vantage job submit data-processing.py \
--cpus 16 \
--memory 128GB \
--tmp-storage 1TB
Environment Configuration
Configure software environments and dependencies:
# Use specific software environment
vantage job submit script.py --env pytorch-env
# Load required modules
vantage job submit script.py --modules python/3.9,cuda/11.2
# Use container image
vantage job submit script.py --container tensorflow:latest
Advanced Submission Options
Array Jobs
Submit multiple related jobs as an array:
# Submit parameter sweep as job array
vantage job submit-array sweep-script.py \
--array 1-100 \
--param-file parameters.txt
# Submit with different input files
vantage job submit-array process-data.py \
--array 1-50 \
--input-pattern "data_${ARRAY_ID}.csv"
Dependency Management
Manage job dependencies and workflows:
# Submit job with dependency
vantage job submit analysis.py --depend-on job-12345
# Chain multiple jobs
vantage job submit preprocess.py --name prep
vantage job submit analysis.py --depend-on-name prep --name analysis
vantage job submit visualize.py --depend-on-name analysis --name viz
Scheduling Options
Control job scheduling and timing:
# Schedule job for future execution
vantage job submit script.py --start-time "2024-01-15 09:00"
# Set job priority
vantage job submit urgent-analysis.py --priority high
# Specify maximum runtime
vantage job submit long-job.py --max-time 24:00:00
Monitoring and Management
Job Status Tracking
Monitor job execution status:
# Check job status
vantage job status job-12345
# List all jobs
vantage job list --user $USER
# Monitor job queue
vantage job queue --status running
Real-time Monitoring
Track job progress during execution:
# Follow job output
vantage job logs job-12345 --follow
# Monitor resource usage
vantage job monitor job-12345 --metrics cpu,memory,gpu
# Check job performance
vantage job stats job-12345
Job Control Operations
Manage running jobs:
# Cancel job
vantage job cancel job-12345
# Suspend job
vantage job suspend job-12345
# Resume suspended job
vantage job resume job-12345
# Modify running job
vantage job modify job-12345 --time +2:00:00
Best Practices
Pre-Submission Checklist
- Test Locally: Verify script functionality with small datasets
- Resource Estimation: Calculate appropriate CPU, memory, and time requirements
- Data Preparation: Ensure all input data is accessible and properly formatted
- Environment Verification: Confirm all dependencies are available
- Output Planning: Define output locations and expected file sizes
Resource Optimization
# Example: Optimize resource usage in script
#!/usr/bin/env python3
import os
import multiprocessing
# Get allocated resources from scheduler
cpu_count = int(os.environ.get('SLURM_CPUS_PER_TASK', 1))
memory_limit = os.environ.get('SLURM_MEM_PER_NODE', '8GB')
# Configure parallel processing
pool = multiprocessing.Pool(processes=cpu_count)
# Memory-efficient data processing
for chunk in data_chunks:
process_chunk(chunk, max_memory=memory_limit)
Error Handling
Implement robust error handling and recovery:
#!/usr/bin/env python3
import sys
import logging
import traceback
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('job.log'),
logging.StreamHandler()
]
)
try:
# Main computation
result = run_analysis()
logging.info(f"Analysis completed successfully: {result}")
except Exception as e:
logging.error(f"Job failed with error: {e}")
logging.error(traceback.format_exc())
# Save partial results if available
if 'partial_results' in locals():
save_checkpoint(partial_results)
logging.info("Partial results saved for recovery")
sys.exit(1)
Performance Optimization
Resource Tuning
- CPU Scaling: Match CPU count to algorithm parallelization
- Memory Management: Optimize memory usage for large datasets
- I/O Optimization: Minimize disk I/O and network transfers
- GPU Utilization: Ensure efficient GPU memory and compute usage
Scheduling Optimization
- Queue Selection: Choose appropriate queues for job characteristics
- Time Estimation: Provide accurate runtime estimates for better scheduling
- Resource Requests: Request only necessary resources to improve queue times
- Job Sizing: Balance job size with scheduling efficiency
Troubleshooting Common Issues
Submission Failures
- Resource Availability: Verify requested resources are available
- Permission Issues: Check access to required files and directories
- Environment Problems: Ensure all software dependencies are accessible
- Script Errors: Validate script syntax and logic before submission
Runtime Issues
- Out of Memory: Increase memory allocation or optimize data usage
- Timeout Errors: Extend job time limits or optimize performance
- Network Issues: Handle network connectivity and data transfer problems
- Dependency Failures: Ensure all required software and data are available
Post-Execution Issues
- Output Collection: Verify all expected outputs were generated
- Result Validation: Check computational results for correctness
- Resource Cleanup: Remove temporary files and release resources
- Performance Analysis: Review job performance for future optimization