EnterpriseTo request access and learn more about fal Compute, please visit the dashboard to get started.
Prerequisites
Before you begin, make sure you have:- A fal.ai account with Compute access
- An SSH key pair for secure instance access
- Basic familiarity with SSH and command line tools
Generate SSH Key (if needed)
If you don’t have an SSH key pair, generate one:Step 1: Create Your Instance
1
Access the Dashboard
- Navigate to the fal Compute Dashboard
- Click the “Create” button
2
Configure Your Instance
- Instance Type: Choose between:
1xH100-SXM: Single GPU for development and smaller workloads8xH100-SXM: Eight GPUs for large-scale training and inference
- Sector Selection:
- Default: For single-instance workloads
- Specific Sector: For multi-node clusters with InfiniBand connectivity
- SSH Key: Paste your public SSH key for secure access
3
Launch Instance
- Review your configuration
- Click “Create” to provision your instance
- Wait for the instance to reach “ready” state (typically 2-3 minutes)
Step 2: Connect to Your Instance
Once your instance is running, you’ll receive connection details:Step 3: Verify Your Setup
After connecting, check your GPU resources:Step 4: Install Your Dependencies
Install your required software stack:Step 5: Run Your First Workload
Test your setup with a simple GPU workload:test_gpu.py
Step 6: Transfer Your Data
For training workloads, you’ll need to transfer your datasets:Next Steps
Now that your instance is running, you can:For Machine Learning
- Training: Start your training scripts with dedicated GPU resources
- Fine-tuning: Adapt pre-trained models with your custom datasets
- Inference: Deploy models for batch or real-time inference
For Multi-GPU Workloads (8xH100)
- Distributed Training: Use frameworks like DeepSpeed, Horovod, or PyTorch DDP
- Model Parallelism: Split large models across multiple GPUs
- Data Parallelism: Process multiple batches simultaneously
For Multi-Node Clusters
- InfiniBand Setup: Configure high-speed inter-node communication
- Cluster Management: Use tools like SLURM or Kubernetes for job scheduling
- Distributed Computing: Scale workloads across multiple instances
Managing Your Instance
Troubleshooting
Common Issues
SSH Connection Failed- Verify your SSH key is correctly configured
- Check instance status in the dashboard
- Ensure your IP is not blocked by firewalls
- Run
nvidia-smito check GPU status - Verify CUDA installation with
nvcc --version - Restart the instance if GPU drivers aren’t loaded
- Monitor GPU memory with
nvidia-smi - Reduce batch sizes in your training scripts
- Use gradient checkpointing to save memory
Getting Help
- Check the fal.ai documentation for detailed guides
- Contact support through the dashboard for technical issues
- Join the community forums for user discussions