Skip to main content
EnterpriseTo request access and learn more about fal Compute, please visit the dashboard to get started.

Prerequisites

Before you begin, make sure you have:
  • A fal.ai account with Compute access
  • An SSH key pair for secure instance access
  • Basic familiarity with SSH and command line tools

Generate SSH Key (if needed)

If you don’t have an SSH key pair, generate one:
# Generate a new SSH key pair
ssh-keygen -t rsa -b 4096 -C "[email protected]"

# Display your public key (you'll need this for instance creation)
cat ~/.ssh/id_rsa.pub

Step 1: Create Your Instance

1

Access the Dashboard

2

Configure Your Instance

  • Instance Type: Choose between:
    • 1xH100-SXM: Single GPU for development and smaller workloads
    • 8xH100-SXM: Eight GPUs for large-scale training and inference
  • Sector Selection:
    • Default: For single-instance workloads
    • Specific Sector: For multi-node clusters with InfiniBand connectivity
  • SSH Key: Paste your public SSH key for secure access
3

Launch Instance

  • Review your configuration
  • Click “Create” to provision your instance
  • Wait for the instance to reach “ready” state (typically 2-3 minutes)

Step 2: Connect to Your Instance

Once your instance is running, you’ll receive connection details:
# Connect via SSH (replace with your actual connection details)
ssh ubuntu@your-instance-ip

# Example connection
ssh [email protected]

Step 3: Verify Your Setup

After connecting, check your GPU resources:
# Check GPU availability
nvidia-smi

# Verify CUDA installation
nvcc --version

# Check storage
df -h

# View system resources
htop
Expected output for 1xH100-SXM:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.xx.xx    Driver Version: 525.xx.xx    CUDA Version: 12.x   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA H100-SXM...  Off  | 00000000:01:00.0 Off |                    0 |
| N/A   27C    P0    68W / 700W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

Step 4: Install Your Dependencies

Install your required software stack:
# Update system packages
sudo apt update && sudo apt upgrade -y

# Install Python and pip (if not already installed)
sudo apt install python3 python3-pip -y

# Install common ML libraries
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Verify PyTorch can see your GPU
python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'GPU count: {torch.cuda.device_count()}')"

Step 5: Run Your First Workload

Test your setup with a simple GPU workload:
test_gpu.py
import torch
import time

# Check if CUDA is available
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    # Create large tensors on GPU
    device = torch.device('cuda')

    # Simple matrix multiplication test
    print("Running GPU compute test...")
    start_time = time.time()

    a = torch.randn(10000, 10000, device=device)
    b = torch.randn(10000, 10000, device=device)
    c = torch.matmul(a, b)

    end_time = time.time()
    print(f"Matrix multiplication completed in {end_time - start_time:.2f} seconds")
    print(f"GPU memory used: {torch.cuda.memory_allocated(device) / 1024**3:.2f} GB")
Run the test:
python3 test_gpu.py

Step 6: Transfer Your Data

For training workloads, you’ll need to transfer your datasets:
# Using scp to transfer files
scp -r /local/path/to/dataset user@your-instance-ip:/remote/path/

# Using rsync for large datasets
rsync -avz -P /local/path/to/dataset/ user@your-instance-ip:/remote/path/dataset/

# Or download directly on the instance
wget https://example.com/dataset.tar.gz
tar -xzf dataset.tar.gz

Next Steps

Now that your instance is running, you can:

For Machine Learning

  • Training: Start your training scripts with dedicated GPU resources
  • Fine-tuning: Adapt pre-trained models with your custom datasets
  • Inference: Deploy models for batch or real-time inference

For Multi-GPU Workloads (8xH100)

  • Distributed Training: Use frameworks like DeepSpeed, Horovod, or PyTorch DDP
  • Model Parallelism: Split large models across multiple GPUs
  • Data Parallelism: Process multiple batches simultaneously

For Multi-Node Clusters

  • InfiniBand Setup: Configure high-speed inter-node communication
  • Cluster Management: Use tools like SLURM or Kubernetes for job scheduling
  • Distributed Computing: Scale workloads across multiple instances

Managing Your Instance

# Monitor GPU usage
watch -n 1 nvidia-smi

# Check disk usage
df -h

# Monitor system resources
htop

# Check network connectivity (for multi-node)
ibstatus  # InfiniBand status

Troubleshooting

Common Issues

SSH Connection Failed
  • Verify your SSH key is correctly configured
  • Check instance status in the dashboard
  • Ensure your IP is not blocked by firewalls
GPU Not Detected
  • Run nvidia-smi to check GPU status
  • Verify CUDA installation with nvcc --version
  • Restart the instance if GPU drivers aren’t loaded
Out of Memory Errors
  • Monitor GPU memory with nvidia-smi
  • Reduce batch sizes in your training scripts
  • Use gradient checkpointing to save memory

Getting Help

  • Check the fal.ai documentation for detailed guides
  • Contact support through the dashboard for technical issues
  • Join the community forums for user discussions