Share Job Submission
Sharing job submissions enables collaboration, knowledge transfer, and standardization of computational workflows across teams and projects. This guide covers strategies for effectively sharing job submission workflows and best practices.
Sharing Approaches
Template-Based Sharing
Create reusable job submission templates:
#!/bin/bash
# Generic analysis job template
# Usage: sbatch analysis_template.sh <input_data> <output_dir>
#SBATCH --job-name=analysis_${1##*/} # Job name based on input file
#SBATCH --account=shared_project
#SBATCH --partition=compute
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=08:00:00
#SBATCH --output=logs/analysis_%j.out
#SBATCH --error=logs/analysis_%j.err
# Input validation
if [[ $# -lt 2 ]]; then
echo "Usage: $0 <input_data> <output_dir>"
exit 1
fi
INPUT_DATA="$1"
OUTPUT_DIR="$2"
# Create output directory
mkdir -p "$OUTPUT_DIR/logs"
# Standard environment setup
module load python/3.9
source /shared/envs/analysis/bin/activate
# Run analysis
python /shared/scripts/generic_analysis.py \
--input "$INPUT_DATA" \
--output "$OUTPUT_DIR" \
--threads $SLURM_CPUS_PER_TASK
Configuration-Driven Submissions
Use configuration files for flexible sharing:
# shared_configs/ml_training.yaml
job_config:
name: "ml_training"
account: "ml_research"
partition: "gpu"
resources:
nodes: 1
cpus_per_task: 32
memory: "256G"
gpus: "a100:4"
time: "24:00:00"
environment:
modules:
- "python/3.9"
- "cuda/11.8"
- "cudnn/8.6"
conda_env: "/shared/envs/ml_training"
script:
path: "/shared/scripts/train_model.py"
parameters:
data_path: "/data/training_sets/"
output_path: "/results/ml_experiments/"
batch_size: 128
learning_rate: 0.001
epochs: 100
#!/usr/bin/env python3
"""
Configuration-based job submitter
"""
import yaml
import subprocess
import argparse
from pathlib import Path
class ConfigurableJobSubmitter:
def __init__(self, config_file):
with open(config_file, 'r') as f:
self.config = yaml.safe_load(f)
def generate_submission_script(self, output_file="job_submit.sh"):
"""Generate SLURM submission script from configuration"""
job_config = self.config['job_config']
env_config = self.config['environment']
script_config = self.config['script']
script_content = f"""#!/bin/bash
#SBATCH --job-name={job_config['name']}
#SBATCH --account={job_config['account']}
#SBATCH --partition={job_config['partition']}
#SBATCH --nodes={job_config['resources']['nodes']}
#SBATCH --cpus-per-task={job_config['resources']['cpus_per_task']}
#SBATCH --mem={job_config['resources']['memory']}
#SBATCH --gres=gpu:{job_config['resources']['gpus']}
#SBATCH --time={job_config['resources']['time']}
#SBATCH --output=logs/{job_config['name']}_%j.out
#SBATCH --error=logs/{job_config['name']}_%j.err
# Create logs directory
mkdir -p logs
# Load modules
"""
for module in env_config['modules']:
script_content += f"module load {module}\n"
script_content += f"""
# Activate conda environment
source activate {env_config['conda_env']}
# Set environment variables
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export CUDA_VISIBLE_DEVICES=0,1,2,3
# Run script
python {script_config['path']} \\
"""
for param, value in script_config['parameters'].items():
script_content += f" --{param.replace('_', '-')} {value} \\\n"
script_content = script_content.rstrip(" \\\n")
with open(output_file, 'w') as f:
f.write(script_content)
Path(output_file).chmod(0o755)
return output_file
def submit_job(self, custom_params=None):
"""Submit job with optional parameter overrides"""
# Generate submission script
script_file = self.generate_submission_script()
# Apply custom parameters if provided
if custom_params:
self.apply_custom_parameters(script_file, custom_params)
# Submit job
result = subprocess.run(['sbatch', script_file],
capture_output=True, text=True)
if result.returncode == 0:
job_id = result.stdout.strip().split()[-1]
print(f"Job submitted successfully: {job_id}")
return job_id
else:
print(f"Job submission failed: {result.stderr}")
return None
# Usage example
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Submit job from configuration')
parser.add_argument('config', help='Configuration file path')
parser.add_argument('--custom-params', help='Custom parameters as JSON')
args = parser.parse_args()
submitter = ConfigurableJobSubmitter(args.config)
job_id = submitter.submit_job()
Collaborative Workflows
Shared Project Structure
Organize shared resources effectively:
/shared/projects/research_collaboration/
├── README.md # Project documentation
├── configs/ # Shared configuration files
│ ├── default_analysis.yaml
│ ├── gpu_training.yaml
│ └── large_memory.yaml
├── scripts/ # Shared job scripts
│ ├── submit_analysis.sh
│ ├── submit_training.sh
│ └── utilities/
│ ├── monitor_jobs.sh
│ └── cleanup_results.sh
├── templates/ # Job templates
│ ├── basic_job.sh
│ ├── array_job.sh
│ └── workflow_job.sh
├── data/ # Shared datasets
│ ├── input/
│ └── reference/
├── results/ # Shared results
│ ├── user1/
│ ├── user2/
│ └── collaborative/
└── docs/ # Documentation
├── usage_guide.md
├── best_practices.md
└── troubleshooting.md
Version Control Integration
Manage shared submissions with Git:
# Initialize shared repository
git init shared-submissions
cd shared-submissions
# Create project structure
mkdir -p {configs,scripts,templates,docs}
# Add .gitignore for cluster-specific files
cat > .gitignore << 'EOF'
# Job output files
*.out
*.err
logs/
results/*/
# Temporary files
*.tmp
*.temp
.sbatch_*
# User-specific configs
user_configs/
local_settings.yaml
# Large data files
data/large_datasets/
*.h5
*.nc
EOF
# Add README with usage instructions
cat > README.md << 'EOF'
# Shared Job Submissions
This repository contains shared job submission scripts and configurations for our research collaboration.
## Quick Start
1. Clone this repository: `git clone /shared/repos/shared-submissions`
2. Choose appropriate configuration from `configs/`
3. Submit job using `scripts/submit_with_config.py`
## Directory Structure
- `configs/`: Configuration files for different job types
- `scripts/`: Reusable submission scripts
- `templates/`: Job template files
- `docs/`: Documentation and guides
## Contributing
1. Create feature branch: `git checkout -b feature/new-analysis`
2. Add your scripts and configurations
3. Test thoroughly before committing
4. Submit pull request for review
EOF
# Initial commit
git add .
git commit -m "Initial shared submission repository"
# Set up as bare repository for sharing
git clone --bare . /shared/repos/shared-submissions.git
Collaborative Job Management
Implement shared job tracking and management:
#!/usr/bin/env python3
"""
Collaborative job management system
"""
import json
import sqlite3
import subprocess
from datetime import datetime
from pathlib import Path
class CollaborativeJobManager:
def __init__(self, db_path="/shared/job_tracking.db"):
self.db_path = db_path
self.init_database()
def init_database(self):
"""Initialize job tracking database"""
conn = sqlite3.connect(self.db_path)
conn.execute('''
CREATE TABLE IF NOT EXISTS jobs (
job_id TEXT PRIMARY KEY,
user TEXT,
project TEXT,
job_name TEXT,
config_file TEXT,
submit_time DATETIME,
status TEXT,
description TEXT,
shared BOOLEAN DEFAULT 0
)
''')
conn.commit()
conn.close()
def register_job(self, job_id, user, project, job_name, config_file,
description="", shared=False):
"""Register a new job submission"""
conn = sqlite3.connect(self.db_path)
conn.execute('''
INSERT INTO jobs
(job_id, user, project, job_name, config_file, submit_time,
status, description, shared)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
''', (job_id, user, project, job_name, config_file,
datetime.now(), 'SUBMITTED', description, shared))
conn.commit()
conn.close()
print(f"Registered job {job_id} for project {project}")
def update_job_status(self, job_id, status):
"""Update job status"""
conn = sqlite3.connect(self.db_path)
conn.execute('''
UPDATE jobs SET status = ? WHERE job_id = ?
''', (status, job_id))
conn.commit()
conn.close()
def get_project_jobs(self, project, active_only=True):
"""Get all jobs for a project"""
conn = sqlite3.connect(self.db_path)
if active_only:
cursor = conn.execute('''
SELECT * FROM jobs
WHERE project = ? AND status IN ('SUBMITTED', 'RUNNING')
ORDER BY submit_time DESC
''', (project,))
else:
cursor = conn.execute('''
SELECT * FROM jobs
WHERE project = ?
ORDER BY submit_time DESC
''', (project,))
jobs = cursor.fetchall()
conn.close()
return jobs
def share_job_config(self, job_id, target_users=None):
"""Share job configuration with other users"""
conn = sqlite3.connect(self.db_path)
cursor = conn.execute('''
SELECT config_file, job_name, description
FROM jobs WHERE job_id = ?
''', (job_id,))
job_info = cursor.fetchone()
conn.close()
if not job_info:
print(f"Job {job_id} not found")
return
config_file, job_name, description = job_info
# Copy configuration to shared area
shared_config = f"/shared/configs/shared_{job_name}_{job_id}.yaml"
subprocess.run(['cp', config_file, shared_config])
# Update job as shared
conn = sqlite3.connect(self.db_path)
conn.execute('''
UPDATE jobs SET shared = 1 WHERE job_id = ?
''', (job_id,))
conn.commit()
conn.close()
print(f"Shared configuration: {shared_config}")
return shared_config
# Usage example
def main():
manager = CollaborativeJobManager()
# Register a new job
job_id = "123456"
manager.register_job(
job_id=job_id,
user="researcher1",
project="protein_folding",
job_name="md_simulation",
config_file="configs/md_config.yaml",
description="Molecular dynamics simulation of protein X",
shared=True
)
# Get project jobs
jobs = manager.get_project_jobs("protein_folding")
print(f"Active jobs in protein_folding project: {len(jobs)}")
# Share job configuration
manager.share_job_config(job_id)
if __name__ == "__main__":
main()
Knowledge Sharing and Documentation
Submission Patterns Library
Create a library of common submission patterns:
# Job Submission Patterns Library
## Pattern 1: CPU-Intensive Analysis
**Use Case**: Large-scale data analysis requiring significant CPU resources
**Configuration**:
```yaml
resources:
cpus: 32-64
memory: "64G-128G"
time: "8-24 hours"
partition: "cpu"
Best For:
- Statistical analysis of large datasets
- Bioinformatics sequence analysis
- Mathematical modeling
Example Usage:
sbatch --cpus-per-task=64 --mem=128G --time=24:00:00 cpu_analysis.sh
Pattern 2: GPU Machine Learning
Use Case: Deep learning model training and inference
Configuration:
resources:
cpus: 16-32
memory: "128G-256G"
gpus: "v100:2-4" or "a100:1-2"
time: "12-48 hours"
partition: "gpu"
Best For:
- Neural network training
- Computer vision tasks
- Natural language processing
Example Usage:
sbatch --gres=gpu:a100:2 --cpus-per-task=32 --mem=256G ml_training.sh
Pattern 3: Memory-Intensive Processing
Use Case: Large dataset processing requiring significant memory
Configuration:
resources:
cpus: 16-32
memory: "256G-1TB"
time: "4-12 hours"
partition: "highmem"
Best For:
- Genome assembly
- Large matrix operations
- In-memory database operations
Pattern 4: Array Jobs for Parameter Sweeps
Use Case: Running multiple similar jobs with different parameters
Configuration:
array: "1-100"
resources:
cpus: 4-8
memory: "16G-32G"
time: "2-4 hours"
Best For:
- Hyperparameter optimization
- Monte Carlo simulations
- Batch file processing
### Best Practices Documentation
Document submission best practices:
```markdown
# Job Submission Best Practices
## Resource Estimation
### CPU Requirements
- **Single-threaded**: 1 CPU
- **Multi-threaded**: Match CPU count to thread count
- **Parallel processing**: Consider memory per CPU ratio
### Memory Guidelines
- **Rule of thumb**: 4-8GB per CPU for general computing
- **Memory-intensive**: 16-32GB per CPU
- **Large datasets**: Size of dataset + processing overhead
### Time Limits
- **Start conservative**: Begin with shorter time limits
- **Monitor usage**: Use `sacct` to track actual runtime
- **Add buffer**: Include 20-30% buffer for variability
## Queue Selection
### Development Queue
- **Purpose**: Testing and debugging
- **Limits**: Short time limits, limited resources
- **Usage**: Small test jobs only
### Standard Queue
- **Purpose**: Regular production jobs
- **Limits**: Moderate time and resource limits
- **Usage**: Most computational work
### Priority Queue
- **Purpose**: Urgent or time-sensitive work
- **Limits**: Higher resource allocation
- **Usage**: Critical deadlines only
## Error Prevention
### Common Mistakes
1. **Under-estimating resources**: Leading to job failures
2. **Over-requesting resources**: Inefficient queue times
3. **Missing dependencies**: Module loading issues
4. **Path problems**: Incorrect file paths
### Validation Checklist
- [ ] Input files exist and are readable
- [ ] Output directories have write permissions
- [ ] Required modules are available
- [ ] Resource requests are reasonable
- [ ] Time limits are sufficient
## Optimization Strategies
### I/O Optimization
- Use local scratch space for intensive I/O
- Minimize data movement during execution
- Compress output files when appropriate
### Resource Efficiency
- Monitor actual usage vs. requested resources
- Adjust requests based on historical data
- Use job arrays for similar tasks
### Troubleshooting
- Check job logs for error messages
- Validate environment setup
- Test with smaller datasets first
Team Collaboration Tools
Shared Monitoring Dashboard
Create collaborative monitoring tools:
#!/usr/bin/env python3
"""
Team job monitoring dashboard
"""
import subprocess
import json
from datetime import datetime
from collections import defaultdict
class TeamJobDashboard:
def __init__(self, team_users):
self.team_users = team_users
def get_team_jobs(self):
"""Get all jobs for team members"""
all_jobs = []
for user in self.team_users:
try:
# Get jobs for user
result = subprocess.run(
['squeue', '-u', user, '--json'],
capture_output=True, text=True, check=True
)
user_jobs = json.loads(result.stdout)
for job in user_jobs.get('jobs', []):
job['user'] = user
all_jobs.append(job)
except subprocess.CalledProcessError:
print(f"Failed to get jobs for user {user}")
return all_jobs
def generate_team_report(self):
"""Generate team job status report"""
jobs = self.get_team_jobs()
# Organize by user and status
user_stats = defaultdict(lambda: defaultdict(int))
queue_usage = defaultdict(int)
total_resources = {'cpus': 0, 'memory': 0}
for job in jobs:
user = job['user']
state = job['job_state']
partition = job['partition']
user_stats[user][state] += 1
queue_usage[partition] += 1
if state == 'RUNNING':
total_resources['cpus'] += job.get('cpus', 0)
# Memory calculation would need parsing
# Generate report
report = f"Team Job Dashboard - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n"
report += "=" * 60 + "\n\n"
# User summary
report += "User Summary:\n"
for user in self.team_users:
stats = user_stats[user]
total_jobs = sum(stats.values())
if total_jobs > 0:
report += f" {user:<15}: {total_jobs:3d} total"
if stats['RUNNING']:
report += f" ({stats['RUNNING']} running"
if stats['PENDING']:
report += f", {stats['PENDING']} pending"
report += ")\n"
# Queue usage
report += "\nQueue Usage:\n"
for queue, count in sorted(queue_usage.items()):
report += f" {queue:<15}: {count} jobs\n"
# Resource utilization
report += f"\nTotal Team Resources:\n"
report += f" CPUs in use: {total_resources['cpus']}\n"
return report
def save_report(self, filename="team_dashboard.txt"):
"""Save report to file"""
report = self.generate_team_report()
with open(filename, 'w') as f:
f.write(report)
return report
# Usage
team_members = ['researcher1', 'researcher2', 'researcher3', 'postdoc1']
dashboard = TeamJobDashboard(team_members)
print(dashboard.generate_team_report())
Communication Integration
Integrate with team communication tools:
#!/bin/bash
# Job notification script for team collaboration
notify_team() {
local job_id="$1"
local user="$2"
local status="$3"
local job_name="$4"
# Get job details
job_info=$(scontrol show job "$job_id")
# Format notification message
message="🔬 Job Update: $job_name ($job_id)
User: $user
Status: $status
Time: $(date)
Job Details:
$job_info"
# Send to team chat (example with Slack webhook)
if [[ -n "$SLACK_WEBHOOK_URL" ]]; then
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"$message\"}" \
"$SLACK_WEBHOOK_URL"
fi
# Send email to team
if [[ -n "$TEAM_EMAIL_LIST" ]]; then
echo "$message" | mail -s "Job Update: $job_name" "$TEAM_EMAIL_LIST"
fi
# Log to shared activity feed
echo "$(date): $user - $job_name ($job_id) - $status" >> \
/shared/team_activity.log
}
# Monitor jobs and send notifications
monitor_team_jobs() {
local team_users=("$@")
local last_check_file="/tmp/last_job_check"
# Get timestamp of last check
if [[ -f "$last_check_file" ]]; then
last_check=$(cat "$last_check_file")
else
last_check=$(date -d "1 hour ago" +%s)
fi
# Check for job state changes since last check
for user in "${team_users[@]}"; do
# Get recent job completions
sacct -u "$user" --starttime="$(date -d "@$last_check")" \
--format=JobID,JobName,User,State,End \
--state=COMPLETED,FAILED,TIMEOUT | \
while read job_id job_name user state end_time; do
if [[ "$job_id" != "JobID" && -n "$end_time" ]]; then
notify_team "$job_id" "$user" "$state" "$job_name"
fi
done
done
# Update last check timestamp
date +%s > "$last_check_file"
}
# Usage (run periodically via cron)
# */15 * * * * /path/to/monitor_team_jobs.sh researcher1 researcher2 researcher3
Submission Templates and Examples
Template Repository
Organize submission templates systematically:
# Create template repository structure
mkdir -p /shared/templates/{basic,gpu,array,workflow}
# Basic CPU job template
cat > /shared/templates/basic/cpu_job.sh << 'EOF'
#!/bin/bash
#SBATCH --job-name=cpu_analysis
#SBATCH --account=__ACCOUNT__
#SBATCH --partition=cpu
#SBATCH --cpus-per-task=__CPUS__
#SBATCH --mem=__MEMORY__
#SBATCH --time=__TIME__
#SBATCH --output=logs/cpu_%j.out
#SBATCH --error=logs/cpu_%j.err
# Template variables to replace:
# __ACCOUNT__ - your account/project
# __CPUS__ - number of CPU cores
# __MEMORY__ - memory allocation (e.g., 32G)
# __TIME__ - time limit (e.g., 08:00:00)
# Create logs directory
mkdir -p logs
# Load modules
module load python/3.9
# Your commands here
echo "Starting CPU job at $(date)"
echo "Job ID: $SLURM_JOB_ID"
echo "CPUs: $SLURM_CPUS_PER_TASK"
# Replace with your actual commands
python your_script.py
echo "Job completed at $(date)"
EOF
# GPU job template
cat > /shared/templates/gpu/gpu_job.sh << 'EOF'
#!/bin/bash
#SBATCH --job-name=gpu_training
#SBATCH --account=__ACCOUNT__
#SBATCH --partition=gpu
#SBATCH --cpus-per-task=__CPUS__
#SBATCH --mem=__MEMORY__
#SBATCH --gres=gpu:__GPU_TYPE__:__GPU_COUNT__
#SBATCH --time=__TIME__
#SBATCH --output=logs/gpu_%j.out
#SBATCH --error=logs/gpu_%j.err
# Template variables:
# __GPU_TYPE__ - gpu type (v100, a100, etc.)
# __GPU_COUNT__ - number of GPUs
mkdir -p logs
# Load GPU modules
module load python/3.9
module load cuda/11.8
module load cudnn/8.6
# Set GPU environment
export CUDA_VISIBLE_DEVICES=0,1,2,3
echo "Starting GPU job at $(date)"
echo "Job ID: $SLURM_JOB_ID"
echo "GPUs: $CUDA_VISIBLE_DEVICES"
# Your GPU commands here
python your_gpu_script.py
echo "Job completed at $(date)"
EOF
# Template instantiation script
cat > /shared/templates/instantiate_template.py << 'EOF'
#!/usr/bin/env python3
"""
Template instantiation utility
"""
import argparse
import sys
from pathlib import Path
def instantiate_template(template_file, substitutions, output_file):
"""Replace template variables with actual values"""
with open(template_file, 'r') as f:
content = f.read()
# Perform substitutions
for placeholder, value in substitutions.items():
content = content.replace(f"__{placeholder}__", str(value))
# Check for remaining placeholders
remaining = [line for line in content.split('\n') if '__' in line and line.strip().startswith('#')]
if remaining:
print("Warning: Unreplaced placeholders found:")
for line in remaining:
print(f" {line.strip()}")
# Write instantiated template
with open(output_file, 'w') as f:
f.write(content)
# Make executable
Path(output_file).chmod(0o755)
print(f"Template instantiated: {output_file}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Instantiate job template')
parser.add_argument('template', help='Template file path')
parser.add_argument('output', help='Output file path')
parser.add_argument('--account', required=True, help='Account/project name')
parser.add_argument('--cpus', type=int, default=16, help='Number of CPUs')
parser.add_argument('--memory', default='64G', help='Memory allocation')
parser.add_argument('--time', default='08:00:00', help='Time limit')
parser.add_argument('--gpu-type', default='v100', help='GPU type')
parser.add_argument('--gpu-count', type=int, default=1, help='Number of GPUs')
args = parser.parse_args()
substitutions = {
'ACCOUNT': args.account,
'CPUS': args.cpus,
'MEMORY': args.memory,
'TIME': args.time,
'GPU_TYPE': args.gpu_type,
'GPU_COUNT': args.gpu_count
}
instantiate_template(args.template, substitutions, args.output)
EOF
chmod +x /shared/templates/instantiate_template.py
Best Practices Summary
Sharing Strategy
- Use version control for submission scripts and configurations
- Create reusable templates with clear parameter substitution
- Document submission patterns and best practices
- Implement collaborative monitoring and notification systems
Knowledge Transfer
- Maintain a library of common submission patterns
- Provide examples for different use cases
- Document troubleshooting procedures and solutions
- Conduct regular knowledge sharing sessions
Collaboration Tools
- Implement shared job tracking and management systems
- Integrate with team communication platforms
- Create monitoring dashboards for team visibility
- Establish code review processes for shared submissions
Documentation
- Keep comprehensive documentation of submission procedures
- Maintain examples and templates for common scenarios
- Document organizational policies and best practices
- Provide troubleshooting guides and FAQ
Next Steps
After establishing submission sharing practices:
- Delete Job Submissions - Learn about cleanup and lifecycle management
- Develop organization-specific submission standards
- Create training materials for new team members
- Implement metrics and analytics for submission effectiveness
- Establish governance policies for shared resources