Job Submissions Introduction
Job submissions in Vantage represent the process of requesting computational resources and executing workloads on cluster infrastructure. Understanding job submission workflows is essential for effectively utilizing cluster resources and managing computational tasks.
What are Job Submissions?
Job submissions are requests to the cluster scheduler that include:
- Resource Requirements: CPU, memory, GPU, and storage specifications
- Execution Instructions: Scripts, commands, or applications to run
- Scheduling Parameters: Priority, queue selection, and timing constraints
- Environment Configuration: Software modules, environment variables, and dependencies
- Input/Output Specifications: Data sources, output destinations, and file handling
Types of Job Submissions
Batch Jobs
Traditional batch processing for non-interactive workloads:
- Computational Jobs: CPU-intensive calculations and simulations
- Data Processing: Large-scale data analysis and transformation
- Rendering Jobs: Graphics and visualization processing
- Machine Learning: Model training and hyperparameter optimization
Interactive Jobs
Real-time interactive computing sessions:
- Development Work: Code development and debugging
- Data Exploration: Interactive analysis and visualization
- Testing: Algorithm testing and validation
- Prototyping: Rapid development and experimentation
Array Jobs
Parallel execution of similar tasks with different parameters:
- Parameter Sweeps: Running simulations with different input parameters
- Data Splits: Processing datasets divided into chunks
- Monte Carlo Simulations: Running multiple random simulations
- Batch Processing: Processing multiple files or datasets
Workflow Jobs
Complex multi-step computational workflows:
- Pipeline Processing: Sequential data processing stages
- Dependency Management: Jobs with dependencies on other jobs
- Conditional Execution: Dynamic workflow paths based on results
- Resource Optimization: Dynamic resource allocation across workflow steps
Job Submission Lifecycle
Submission Phase
- Job Preparation: Resource planning and script development
- Parameter Configuration: Setting resource requirements and options
- Validation: Checking job specifications and dependencies
- Queue Submission: Submitting job to scheduler queue
Scheduling Phase
- Queue Management: Job placement in appropriate queue
- Resource Allocation: Matching job requirements with available resources
- Priority Processing: Scheduling based on priority and policies
- Execution Start: Launching job on allocated resources
Execution Phase
- Environment Setup: Configuring software environment and dependencies
- Data Staging: Moving input data to compute nodes
- Job Execution: Running computational workload
- Output Collection: Gathering results and logs
Completion Phase
- Resource Cleanup: Releasing allocated resources
- Output Transfer: Moving results to permanent storage
- Status Reporting: Recording job completion status and metrics
- Billing and Accounting: Resource usage tracking and accounting
Submission Methods
Command Line Interface
Direct submission using cluster commands:
# SLURM example
sbatch --job-name=analysis \
--cpus-per-task=16 \
--mem=32G \
--time=04:00:00 \
--output=job_%j.out \
analysis_script.sh
# PBS example
qsub -N analysis \
-l select=1:ncpus=16:mem=32gb \
-l walltime=04:00:00 \
-o job_output.log \
analysis_script.sh
Web Interface
Browser-based job submission through Vantage portal:
- Form-Based Submission: Fill out web forms with job parameters
- Template Selection: Choose from pre-configured job templates
- File Upload: Upload scripts and data through web interface
- Real-Time Monitoring: Track job progress and status
API Integration
Programmatic submission using REST APIs:
import requests
import json
# Vantage API submission example
job_spec = {
"name": "data_analysis",
"script": "analysis.py",
"resources": {
"cpus": 16,
"memory": "32GB",
"walltime": "04:00:00"
},
"environment": {
"modules": ["python/3.9", "numpy/1.21"],
"variables": {"OMP_NUM_THREADS": "16"}
},
"input_files": ["data.csv", "config.yaml"],
"output_directory": "/results/analysis_run_001"
}
response = requests.post(
"https://vantage.cluster.org/api/jobs",
headers={"Authorization": "Bearer YOUR_TOKEN"},
json=job_spec
)
job_id = response.json()["job_id"]
print(f"Job submitted with ID: {job_id}")
Workflow Engines
Integration with workflow management systems:
# Nextflow example
process ANALYSIS {
cpus 16
memory '32 GB'
time '4h'
input:
path input_data
output:
path "results/*"
script:
"""
python analysis.py --input ${input_data} --output results/
"""
}
Resource Management
Resource Types
Understanding different resource categories:
Compute Resources
- CPU Cores: Processing power for computational tasks
- Memory (RAM): Working memory for data processing
- Local Storage: Temporary disk space for job execution
- Network Bandwidth: Data transfer capabilities
Specialized Resources
- GPUs: Graphics processing units for parallel computing
- High Memory Nodes: Nodes with large memory capacity
- Fast Storage: High-performance storage systems
- Interconnect: High-speed networking for parallel jobs
Software Resources
- Licensed Software: Applications requiring license management
- Container Runtime: Docker/Singularity container support
- Parallel Libraries: MPI, OpenMP, and other parallel frameworks
- Specialized Tools: Domain-specific software and libraries
Resource Specification
Properly specifying resource requirements:
#!/bin/bash
#SBATCH --job-name=ml_training
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=128G
#SBATCH --gres=gpu:v100:2
#SBATCH --time=12:00:00
#SBATCH --output=logs/training_%j.out
#SBATCH --error=logs/training_%j.err
# Load required modules
module load python/3.9
module load cuda/11.2
module load cudnn/8.2
# Set environment variables
export OMP_NUM_THREADS=16
export CUDA_VISIBLE_DEVICES=0,1
# Run training job
python train_model.py \
--data /data/training_set.h5 \
--output /models/experiment_001/ \
--batch-size 128 \
--epochs 100 \
--gpus 2
Resource Optimization
Strategies for efficient resource utilization:
Right-Sizing Resources
- CPU Allocation: Match CPU count to application parallelization
- Memory Sizing: Allocate sufficient memory without waste
- Time Limits: Set realistic walltime limits
- GPU Usage: Request GPUs only when needed
Monitoring and Tuning
# Monitor resource usage during execution
monitor_job_resources() {
local job_id="$1"
while squeue -j "$job_id" &>/dev/null; do
# CPU and memory usage
sstat -j "$job_id" --format=JobID,AveCPU,AveRSS,MaxRSS,MaxRSSNode
# GPU utilization (if applicable)
if command -v nvidia-smi &>/dev/null; then
nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total \
--format=csv,noheader,nounits
fi
sleep 60
done
}
Job Queues and Scheduling
Queue Types
Understanding different queue categories:
Standard Queues
- General Purpose: Default queue for most workloads
- Short Jobs: Quick jobs with limited runtime
- Long Jobs: Extended runtime for large computations
- Development: Interactive development and testing
Specialized Queues
- GPU Queue: Jobs requiring GPU resources
- High Memory: Jobs requiring large memory allocation
- Priority Queue: High-priority or urgent jobs
- Preemptible: Lower-cost queue with potential interruption
Organizational Queues
- Project Queues: Dedicated resources for specific projects
- Department Queues: Resources allocated to departments
- Research Queues: Academic research allocations
- Commercial Queues: Industry and commercial users
Scheduling Policies
How jobs are prioritized and scheduled:
Fair Share Scheduling
- User Quotas: Allocation based on user or project quotas
- Historical Usage: Priority adjustment based on past usage
- Resource Sharing: Fair distribution among users and projects
- Backfill Scheduling: Efficient resource utilization
Priority-Based Scheduling
# Submit high-priority job
sbatch --priority=high \
--qos=urgent \
critical_analysis.sh
# Submit low-priority job that can be preempted
sbatch --priority=low \
--qos=preemptible \
background_processing.sh
Best Practices
Job Design
Designing efficient and reliable jobs:
Modular Design
- Script Organization: Break complex workflows into modular components
- Error Handling: Implement robust error checking and recovery
- Logging: Comprehensive logging for debugging and monitoring
- Checkpointing: Save intermediate results for long-running jobs
Resource Efficiency
- Parallel Processing: Utilize available CPU cores effectively
- Memory Management: Optimize memory usage patterns
- I/O Optimization: Minimize data movement and file operations
- Cleanup: Proper cleanup of temporary files and resources
Submission Strategy
Optimizing job submission patterns:
Batch Optimization
# Submit multiple related jobs efficiently
for param in $(seq 1 100); do
sbatch --job-name="analysis_${param}" \
--export=PARAM=${param} \
parameter_sweep.sh
done
# Or use job arrays for better efficiency
sbatch --array=1-100 \
--job-name="parameter_sweep" \
array_analysis.sh
Dependency Management
# Submit jobs with dependencies
job1=$(sbatch --parsable job1.sh)
job2=$(sbatch --dependency=afterok:$job1 --parsable job2.sh)
job3=$(sbatch --dependency=afterok:$job2 --parsable job3.sh)
echo "Submitted workflow: $job1 -> $job2 -> $job3"
Monitoring and Troubleshooting
Tracking job progress and resolving issues:
Job Status Monitoring
# Check job status
squeue -u $USER
# Detailed job information
scontrol show job $JOB_ID
# Job accounting information
sacct -j $JOB_ID --format=JobID,JobName,State,ExitCode,Elapsed,MaxRSS
Common Issues and Solutions
- Resource Limitations: Adjust resource requests based on actual usage
- Software Dependencies: Ensure all required modules and libraries are available
- Data Access: Verify input data is accessible from compute nodes
- Output Permissions: Check write permissions for output directories
Getting Started
To begin submitting jobs effectively:
- Understand Cluster Policies: Review resource limits and usage policies
- Choose Appropriate Queues: Select queues matching your workload requirements
- Start Small: Begin with simple jobs before scaling to complex workflows
- Monitor Usage: Track resource utilization and optimize accordingly
- Learn from Examples: Study successful job submissions from others
Next Steps
- Create Job Submissions - Learn to submit jobs effectively
- Share Job Submissions - Collaborate on job workflows
- Delete Job Submissions - Manage job lifecycle and cleanup