Agent SkillsAgent Skills
fl-sean03

hpc-cluster

@fl-sean03/hpc-cluster
fl-sean03
2
0 forks
Updated 4/6/2026
View on GitHub

Run jobs on CU Boulder CURC HPC cluster (Alpine). Use when simulations need more compute than the local workstation, for large-scale parallel jobs, or when GPU resources are needed beyond local availability. You have full SSH access - work like a researcher.

Installation

$npx agent-skills-cli install @fl-sean03/hpc-cluster
Claude Code
Cursor
Copilot
Codex
Antigravity

Details

Path.claude/skills/hpc-cluster/SKILL.md
Branchmain
Scoped Name@fl-sean03/hpc-cluster

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions


name: hpc-cluster description: Run jobs on CU Boulder CURC HPC cluster (Alpine). Use when simulations need more compute than the local workstation, for large-scale parallel jobs, or when GPU resources are needed beyond local availability. You have full SSH access - work like a researcher. allowed-tools:

  • Read
  • Write
  • Edit
  • Bash
  • Glob
  • Grep
  • WebSearch
  • WebFetch

CURC HPC Cluster Access (CU Boulder Alpine)

You have full SSH access to CU Boulder's Alpine HPC cluster. You can do everything a human researcher can do: submit jobs, debug failures, load modules, transfer files, and work autonomously.

Quick Reference

ItemValue
Loginssh $CURC_USER@login.rc.colorado.edu
Filesystem/scratch/alpine/$CURC_USER/ (10TB, fast I/O)
Agent Workspace/scratch/alpine/$CURC_USER/Agent_Runs/
Job SchedulerSLURM
Default Partitionamilan (CPU), aa100 (GPU)
AuthenticationSSH key (pre-configured)
HPC Client.claude/skills/hpc-cluster/hpc_client.py

Two Ways to Work

You have two approaches available:

1. Python HPC Client (Recommended for common operations)

A lightweight client that handles connection management and common patterns:

import sys
import os
# Add the skill directory to path (relative to project root)
skill_dir = os.path.join(os.environ.get('PROJECT_ROOT', '.'), '.claude/skills/hpc-cluster')
sys.path.insert(0, skill_dir)
from hpc_client import HPCClient

hpc = HPCClient()
hpc.connect()

# Create workspace, upload files, submit job, wait for completion
run_dir = hpc.create_run("argon-diffusion")
hpc.upload("input.lmp", f"{run_dir}/input.lmp")
hpc.upload("job.slurm", f"{run_dir}/job.slurm")
job_id = hpc.submit(f"{run_dir}/job.slurm")
status = hpc.wait_for_job(job_id, timeout=3600)

if status.is_success:
    hpc.download(f"{run_dir}/output.dat", "./results/")
else:
    # Debug: read error output
    print(hpc.read_file(f"{run_dir}/my_job_{job_id}.err"))

hpc.disconnect()

2. Direct SSH (For full control)

When you need to do something the client doesn't support, use raw SSH:

# Run any command
ssh $CURC_USER@login.rc.colorado.edu "your command here"

# Interactive debugging
ssh $CURC_USER@login.rc.colorado.edu

Use the client for: workspace setup, file transfer, job submission, job monitoring Use raw SSH for: debugging, exploring, unusual operations, anything not covered


Connection

SSH Access

SSH is pre-configured with key-based authentication and connection multiplexing via ~/.ssh/config. Use the cu_alpine alias for simplicity:

# Connect to CURC login node (uses ~/.ssh/config)
ssh cu_alpine

# Run a single command
ssh cu_alpine "squeue -u $CURC_USER"

# Or use full address
ssh $CURC_USER@login.rc.colorado.edu "squeue -u $CURC_USER"

# Transfer files TO HPC
scp local_file.txt $CURC_USER@login.rc.colorado.edu:/scratch/alpine/$CURC_USER/

# Transfer files FROM HPC
scp $CURC_USER@login.rc.colorado.edu:/scratch/alpine/$CURC_USER/results.dat ./

Connection multiplexing: The SSH config uses ControlMaster to reuse connections - the first connection is slower, but subsequent ones are instant.

Important: The login node is for submitting jobs and light tasks. Never run compute-intensive work directly on login nodes.


Workspace Structure

All agent work on HPC goes in the existing Agent_Runs directory:

/scratch/alpine/$CURC_USER/Agent_Runs/
├── argon-diffusion-20260118/
│   ├── inputs/
│   ├── outputs/
│   ├── job.slurm
│   └── README.md
├── water-tip4p-20260119/
├── shared/
│   ├── potentials/          # Downloaded force fields
│   ├── pseudopotentials/    # Downloaded pseudopotentials
│   └── scripts/             # Reusable analysis scripts
└── ...

Creating a New Run

# Create run directory with timestamp
RUN_NAME="project-name-$(date +%Y%m%d-%H%M%S)"
RUN_DIR="/scratch/alpine/$CURC_USER/Agent_Runs/$RUN_NAME"
ssh cu_alpine "mkdir -p $RUN_DIR/{inputs,outputs}"

SLURM Job Submission

Job Script Template

#!/bin/bash
#SBATCH --job-name=my_simulation
#SBATCH --partition=amilan          # CPU partition (or aa100 for GPU)
#SBATCH --nodes=1
#SBATCH --ntasks=32                 # Number of MPI tasks
#SBATCH --time=04:00:00             # Max runtime (HH:MM:SS)
#SBATCH --output=%x_%j.out          # stdout file
#SBATCH --error=%x_%j.err           # stderr file
#SBATCH --mail-type=END,FAIL        # Email notifications
#SBATCH --mail-user=your@email.com

# Load required modules
module purge
module load gcc/13.1.0
module load openmpi/4.1.6

# Change to run directory
cd $SLURM_SUBMIT_DIR

# Run your simulation
mpirun -np $SLURM_NTASKS ./your_program input.in

Key SLURM Commands

CommandPurpose
sbatch job.slurmSubmit batch job
squeue -u $USERCheck your job status
squeue -j <jobid>Check specific job
scancel <jobid>Cancel a job
sinfo -p amilanCheck partition status
sacct -j <jobid>Job accounting info
scontrol show job <jobid>Detailed job info

Job Status Codes

CodeMeaning
PDPending (waiting for resources)
RRunning
CGCompleting
CDCompleted
FFailed
TOTimeout
CACancelled

Available Partitions

Partition Selection Strategy

CRITICAL: Always validate on testing partition first before production runs!

Workflow:
1. atesting / atesting_a100  →  Validate job script works (1 hour max)
2. amilan / aa100            →  Production runs (24 hour max)
3. amilan + qos=long         →  Extended runs (7 day max, lower priority)

Testing Partitions (Use First!)

PartitionLimitsMax TimePurpose
atesting2 nodes, 16 cores max1hValidate CPU jobs work before production
atesting_a1001 GPU, 10 cores max1hValidate GPU jobs work before production
atesting_mi1001 GPU, 10 cores max1hValidate AMD GPU jobs

Always run a short test on atesting first to catch:

  • Module loading issues
  • Path errors
  • Input file problems
  • Memory requirements

Production CPU Partitions

PartitionNodesCores/NodeRAM/NodeMax TimeUse For
amilan38732-64256 GB (3.75 GB/core)24hDefault for production CPU jobs
amilan128c16128256 GB (2 GB/core)24hHigh core count on single node (see below)
amem2448-128up to 2 TB24hMemory-intensive (requires --qos=mem, must request 256GB+)

When to Use amilan128c vs amilan

Use amilan128c when:

  • Your job benefits from 128 cores on ONE node (vs spreading across multiple nodes)
  • Running OpenMP/shared-memory parallel codes
  • High inter-process communication (MPI with frequent small messages)
  • Tightly-coupled simulations where network latency hurts performance
  • Large LAMMPS/QE jobs that scale well but suffer from inter-node communication

Use regular amilan when:

  • Your job needs fewer than 64 cores
  • You need multiple nodes (amilan has 387 nodes vs only 16 for 128c)
  • Memory per core matters more (3.75 GB/core vs 2 GB/core on 128c)
  • Queue wait time is a concern (more nodes = shorter queue)

Example: 128-core single-node LAMMPS job

#SBATCH --partition=amilan128c
#SBATCH --nodes=1
#SBATCH --ntasks=128           # Use all 128 cores
#SBATCH --time=12:00:00

Production GPU Partitions

PartitionNodesGPUs/NodeGPU TypeMax TimeUse For
aa100113NVIDIA A100 (40GB)24hBest for CUDA, ML/DL, GPU-accelerated MD
ami10073AMD MI10024hROCm/HIP workloads
al4033NVIDIA L4024hNewer architecture, visualization

Special Partitions

PartitionMax TimePurpose
acompile12hCompiling software only (use via acompile command)
csu24hColorado State contributed nodes
amc24hCU Anschutz contributed nodes

QoS (Quality of Service)

QoSMax TimePriorityWhen to Use
normal24hNormalDefault - use for most jobs
long7 daysLowerExtended simulations (will wait longer in queue)
mem24hNormalRequired for amem partition (high-memory jobs)

Partition Selection Examples

# 1. TESTING: Always start here to validate your job works
#SBATCH --partition=atesting
#SBATCH --time=00:30:00
#SBATCH --ntasks=4

# 2. PRODUCTION CPU: After testing passes
#SBATCH --partition=amilan
#SBATCH --time=04:00:00
#SBATCH --ntasks=32

# 3. PRODUCTION GPU: For GPU-accelerated codes
#SBATCH --partition=aa100
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00

# 4. LONG RUNS: When 24h isn't enough (lower priority)
#SBATCH --partition=amilan
#SBATCH --qos=long
#SBATCH --time=168:00:00   # 7 days

# 5. HIGH MEMORY: For memory-intensive jobs (256GB+ required)
#SBATCH --partition=amem
#SBATCH --qos=mem
#SBATCH --mem=512G
#SBATCH --time=12:00:00

Module System

Software is managed through environment modules. Always work from a compute node or compile node, not login.

Essential Commands

# List available modules
module avail

# Search for specific software
module spider lammps
module spider python

# Load modules
module load gcc/13.1.0
module load openmpi/4.1.6
module load lammps/20230802

# See what's loaded
module list

# Unload all modules
module purge

# Save/restore module sets
module save my_env
module restore my_env

Finding and Loading Software

Software on CURC is installed in /curc/sw/install/. To find what's available:

# List all installed software
ls /curc/sw/install/

# Check specific software versions
ls /curc/sw/install/lammps/    # LAMMPS versions (22July25, 2Sept25, etc.)
ls /curc/sw/install/QE/        # Quantum ESPRESSO (7.0, 7.2)
ls /curc/sw/install/gromacs/   # GROMACS versions

LAMMPS example (check exact paths for current versions):

# Find the binary
ls /curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin/

# In job script
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"
mpirun -np $SLURM_NTASKS lmp -in input.lmp

Quantum ESPRESSO example:

module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/QE/7.2/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"
mpirun -np $SLURM_NTASKS pw.x < input.in > output.out

Note: Module dependencies matter. Load compiler first, then MPI. Check exact version paths as they may change.


Storage Filesystem

Paths and Quotas

PathQuotaPurgeUse For
/home/$USER2 GBNeverScripts, small configs
/projects/$USER250 GBNeverCode, small datasets
/scratch/alpine/$USER10 TB90 daysJob I/O, large files
$SLURM_SCRATCH~300 GBJob endNode-local temp storage

Performance Rules

DO:

  • Run all job I/O on /scratch/alpine/
  • Use $SLURM_SCRATCH for intensive temporary files
  • Copy results back after job completes

DON'T:

  • Run I/O-intensive jobs on /home or /projects (will be killed)
  • Store important data only on /scratch (it's purged!)
  • Leave large files on login nodes

Example Workflows

Recommended Workflow: Test First, Then Production

Step 1: Create a testing job script (job_test.slurm)

#!/bin/bash
#SBATCH --job-name=argon_test
#SBATCH --partition=atesting        # <-- TEST PARTITION FIRST
#SBATCH --nodes=1
#SBATCH --ntasks=4                  # Small scale for testing
#SBATCH --time=00:30:00             # 30 min is plenty for testing
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

echo "=== Testing job script ==="
echo "Started at: $(date)"
echo "Running on: $(hostname)"

module purge
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"

cd $SLURM_SUBMIT_DIR
echo "Working directory: $(pwd)"
echo "Input files: $(ls -la)"

# Run short test (reduce timesteps in input for testing)
mpirun -np $SLURM_NTASKS lmp -in input.lmp

echo "Finished at: $(date)"

Step 2: If test passes, create production job (job_prod.slurm)

#!/bin/bash
#SBATCH --job-name=argon_prod
#SBATCH --partition=amilan          # <-- PRODUCTION PARTITION
#SBATCH --nodes=1
#SBATCH --ntasks=32                 # Full scale
#SBATCH --time=04:00:00             # Appropriate for full run
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

module purge
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"

cd $SLURM_SUBMIT_DIR
mpirun -np $SLURM_NTASKS lmp -in input.lmp

LAMMPS MD Simulation (Full Example)

#!/bin/bash
#SBATCH --job-name=argon_md
#SBATCH --partition=amilan
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=02:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

module purge
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"

cd $SLURM_SUBMIT_DIR
mpirun -np $SLURM_NTASKS lmp -in input.lmp

Quantum ESPRESSO DFT

#!/bin/bash
#SBATCH --job-name=si_scf
#SBATCH --partition=amilan
#SBATCH --nodes=2
#SBATCH --ntasks=64
#SBATCH --time=04:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

module purge
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/QE/7.2/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"

cd $SLURM_SUBMIT_DIR
mpirun -np $SLURM_NTASKS pw.x < si_scf.in > si_scf.out

GPU Job (Testing First)

Test on atesting_a100:

#!/bin/bash
#SBATCH --job-name=md_gpu_test
#SBATCH --partition=atesting_a100   # <-- GPU TESTING
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --time=00:30:00
#SBATCH --output=%x_%j.out

module purge
module load gcc/12.2.0 cuda/12.1.1
# Add LAMMPS GPU path here

cd $SLURM_SUBMIT_DIR
lmp -k on g 1 -sf kk -pk kokkos gpu/aware off -in input.lmp

Then production on aa100:

#!/bin/bash
#SBATCH --job-name=md_gpu_prod
#SBATCH --partition=aa100           # <-- GPU PRODUCTION
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:3                # Can use up to 3 GPUs per node
#SBATCH --time=04:00:00
#SBATCH --output=%x_%j.out

module purge
module load gcc/12.2.0 cuda/12.1.1

cd $SLURM_SUBMIT_DIR
lmp -k on g 3 -sf kk -pk kokkos gpu/aware off -in input.lmp

Debugging Failed Jobs

When a job fails, investigate systematically:

1. Check Job Status

# See why it failed
sacct -j <jobid> --format=JobID,State,ExitCode,Reason

# Get detailed info
scontrol show job <jobid>

2. Read Output Files

# Check stdout
cat my_job_12345.out

# Check stderr (often has the real error)
cat my_job_12345.err

# Check application logs
cat log.lammps

3. Common Failure Reasons

IssueSymptomSolution
TimeoutState=TIMEOUTIncrease --time or optimize
MemoryState=OUT_OF_MEMORYIncrease nodes or use amem
Module not found"command not found"Check module load order
Bad path"file not found"Use absolute paths
Wrong partitionJob pending foreverCheck partition resources

4. Interactive Debugging

# Get interactive session for debugging
sinteractive --partition=atesting --time=01:00:00 --ntasks=4

# Then run commands interactively to debug
module load lammps
lmp -in input.lmp  # See errors in real-time

File Transfer

Between Local and HPC

# Upload input files
scp -r ./inputs/ $CURC_USER@login.rc.colorado.edu:/scratch/alpine/$CURC_USER/agent-workspace/runs/my-run/

# Download results
scp $CURC_USER@login.rc.colorado.edu:/scratch/alpine/$CURC_USER/agent-workspace/runs/my-run/output.dat ./

# Sync directories (rsync is more efficient for updates)
rsync -avz ./project/ $CURC_USER@login.rc.colorado.edu:/scratch/alpine/$CURC_USER/project/

Large File Transfers

For very large files, use Globus (web-based) or DTN nodes:

# Use data transfer node for large transfers
scp large_file.tar $CURC_USER@dtn.rc.colorado.edu:/scratch/alpine/$CURC_USER/

Queue Times and Async Job Management

Understanding Queue Wait Times

CRITICAL: HPC jobs don't start immediately. Queue times vary dramatically:

PartitionTypical WaitWhy
atestingMinutesTesting partition, low demand
amilanMinutes to hoursMany nodes (387), high throughput
amilan128cHours to DAYSOnly 16 nodes, high demand
aa100Hours to daysOnly 11 nodes, GPU scarcity

Before submitting, check the queue:

# See pending jobs and estimated start times
ssh cu_alpine "squeue -p amilan128c --start"

# Quick queue depth check
ssh cu_alpine "squeue -p amilan128c --state=PENDING | wc -l"

Async Workflow (For Long Queue Times)

DON'T block waiting for jobs with multi-day queues. Instead:

from hpc_client import HPCClient

hpc = HPCClient()
hpc.connect()

# 1. Check queue before choosing partition
status = hpc.get_queue_status('amilan128c')
print(f"Estimated wait: {status['estimated_wait']}")
print(f"Pending jobs: {status['pending_jobs']}")

# 2. Compare partitions to choose wisely
for part in hpc.compare_partitions(['amilan', 'amilan128c', 'aa100']):
    print(f"{part['partition']}: {part['estimated_wait']}, {part['pending_jobs']} pending")

# 3. Submit async (returns immediately, saves tracking file)
tracking = hpc.submit_async(f"{run_dir}/job.slurm")
print(f"Job {tracking['job_id']} submitted")
print(f"Estimated start: {tracking['estimated_start']}")
# Returns immediately - don't wait!

# 4. Later: Check on all submitted jobs
jobs = hpc.check_async_jobs()
for job in jobs:
    print(f"Job {job['job_id']}: {job['current_status']}")
    if job['is_finished']:
        print(f"  Completed! Success: {job['is_success']}")

Workflow Strategy for Long-Running Studies

For multi-day queue scenarios:

Day 1: Submit jobs
├── Check queue status
├── Submit with submit_async()
├── Note estimated start times
└── Move on to other work

Day 2+: Check periodically
├── hpc.check_async_jobs()
├── If still PENDING: wait
├── If RUNNING: monitor progress
└── If COMPLETED: download results and analyze

SLURM Email Notifications (Recommended)

Add to your job scripts for automatic notifications:

#SBATCH --mail-type=BEGIN,END,FAIL    # When to email
#SBATCH --mail-user=your@email.com    # Your email

# Options: NONE, BEGIN, END, FAIL, REQUEUE, ALL
# BEGIN = job started (left queue)
# END = job finished
# FAIL = job failed

Smart Partition Selection

Decision tree:

Need GPU?
├── YES → Check aa100 queue
│         └── Long wait? Consider if job can run on CPU instead
└── NO → How many cores?
         ├── ≤64 cores → amilan (shorter queue, more nodes)
         └── >64 cores or tightly-coupled →
             └── Check amilan128c queue
                 └── Wait >24h? Consider splitting across amilan nodes

Check Job Progress

# One-time status check with start time estimates
ssh cu_alpine "squeue -u $CURC_USER --start"

# See job details
ssh cu_alpine "scontrol show job <jobid>"

# Check why job is pending
ssh cu_alpine "squeue -j <jobid> --format='%r'"  # Shows REASON

Wait for Job Completion (Short Jobs Only)

Only use blocking wait for jobs expected to complete within minutes:

# Poll until job completes (ONLY for short jobs!)
JOB_ID=12345
while ssh cu_alpine "squeue -j $JOB_ID 2>/dev/null | grep -q $JOB_ID"; do
    echo "Job $JOB_ID still running..."
    sleep 60
done
echo "Job $JOB_ID completed"

# Check final status
ssh cu_alpine "sacct -j $JOB_ID --format=JobID,State,ExitCode"

Key Principles

You Are a Researcher

You have the same access a human researcher has. You can:

  • Create any job script you need
  • Load any available module
  • Debug failures by reading logs
  • Adapt to different software versions
  • Figure out problems through investigation

Don't Just Execute - Verify

After running on HPC:

  1. Check job completed successfully (not just submitted)
  2. Verify output files exist and have content
  3. Check for error messages in stderr
  4. Validate results are physically reasonable

Document Your Work

Leave breadcrumbs for yourself:

# In job script
echo "Job started at $(date)"
echo "Running on $(hostname)"
echo "Loaded modules: $(module list 2>&1)"

Reference Links