# RTC Profiling Guide This guide explains how to profile RTC (Real-Time Chunking) performance to identify bottlenecks and understand why RTC might be slower than expected. ## Quick Start ### 1. Profile with Real Robot (Profiled Version) Use `eval_with_real_robot_profiled.py` to profile actual robot execution: ```bash # With RTC enabled uv run examples/rtc/eval_with_real_robot_profiled.py \ --policy.path=helper2424/pi05_check_rtc \ --policy.device=mps \ --rtc.enabled=true \ --rtc.execution_horizon=20 \ --robot.type=so100_follower \ --robot.port=/dev/tty.usbmodem58FA0834591 \ --robot.id=so100_follower \ --robot.cameras="{ gripper: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}, front: {type: opencv, index_or_path: 1, width: 640, height: 480, fps: 30}}" \ --task="Move green small object into the purple platform" \ --duration=30 # Without RTC for comparison uv run examples/rtc/eval_with_real_robot_profiled.py \ --policy.path=helper2424/pi05_check_rtc \ --policy.device=mps \ --rtc.enabled=false \ --robot.type=so100_follower \ --robot.port=/dev/tty.usbmodem58FA0834591 \ --robot.id=so100_follower \ --robot.cameras="{ gripper: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}, front: {type: opencv, index_or_path: 1, width: 640, height: 480, fps: 30}}" \ --task="Move green small object into the purple platform" \ --duration=30 ``` **Output**: At the end of execution, you'll see a detailed breakdown of timing for each component: - `get_actions.policy_inference` - Time spent in policy inference - `get_actions.preprocessing` - Time spent preprocessing observations - `get_actions.postprocessing` - Time spent postprocessing actions - `get_actions.action_queue_merge` - Time spent merging actions with RTC - `robot.get_observation` - Time to get observations from robot - `robot.send_action` - Time to send actions to robot - And more... ### 2. Profile Without Robot (Comparison Script) Use `profile_rtc_comparison.py` to profile just the policy inference without needing a robot: ```bash uv run examples/rtc/profile_rtc_comparison.py \ --policy_path=helper2424/pi05_check_rtc \ --device=mps \ --num_iterations=50 \ --execution_horizon=20 ``` **Output**: Side-by-side comparison of performance with and without RTC, including: - Mean/min/max inference times - Throughput (iterations per second) - Verdict on whether RTC is faster or slower ### 3. Enable Detailed Method-Level Profiling For even more granular profiling, add the `--enable_detailed_profiling` flag: ```bash uv run examples/rtc/profile_rtc_comparison.py \ --policy_path=helper2424/pi05_check_rtc \ --device=mps \ --num_iterations=50 \ --execution_horizon=20 \ --enable_detailed_profiling ``` This will show timing for individual methods within the policy. ## Understanding the Output ### Key Metrics to Look At 1. **get_actions.policy_inference** - This should be the largest component - If RTC is enabled, this includes the RTC guidance overhead - Compare this with/without RTC to see the overhead 2. **get_actions.preprocessing** - Image preprocessing and normalization - Should be relatively fast - If slow, consider optimizing image processing 3. **get_actions.postprocessing** - Action denormalization - Should be minimal - If slow, check postprocessor implementation 4. **get_actions.action_queue_merge** - RTC-specific merging logic - Only present when RTC is enabled - If this is taking significant time, the RTC algorithm may need optimization 5. **robot.get_observation** - Robot communication overhead - If slow, check camera/sensor latency - Consider reducing image resolution 6. **robot.send_action** - Action execution overhead - Should be very fast - If slow, check robot communication ### Expected Performance For a typical Pi0 policy on Apple Silicon (MPS): - **Without RTC**: ~100-200ms per inference - **With RTC**: Should be similar or slightly faster due to action reuse - **Preprocessing**: ~5-20ms depending on number of cameras - **Postprocessing**: ~1-5ms If RTC is significantly slower, likely causes: 1. **RTC overhead exceeds benefits** - The guidance computation is expensive 2. **Execution horizon too small** - Not reusing enough actions to amortize overhead 3. **No compilation** - Try with `--use_torch_compile` 4. **Large prev_actions buffer** - Copying/processing previous actions is slow ## Profiling Your Own Code ### Using the Profiling Decorator Add profiling to your own methods: ```python from lerobot.utils.profiling import profile_method, enable_profiling, print_profiling_summary # Enable profiling enable_profiling() # Decorate methods you want to profile @profile_method def my_slow_function(x): # ... your code ... return result # At end of execution print_profiling_summary() ``` ### Using Profile Context Manager For profiling specific code blocks: ```python from lerobot.utils.profiling import profile_section, enable_profiling enable_profiling() with profile_section("data_loading"): data = load_data() with profile_section("model_inference"): output = model(data) ``` ### Adding Profiling to Policy Methods To profile specific parts of the Pi0 policy, you can add decorators: ```python # In src/lerobot/policies/pi0/modeling_pi0.py from lerobot.utils.profiling import profile_method, profile_section class Pi0Policy: @profile_method def predict_action_chunk(self, obs, inference_delay=0, prev_chunk_left_over=None): # ... existing code ... pass def _generate_actions_with_rtc(self, ...): with profile_section("rtc.guidance_computation"): # ... guidance code ... pass with profile_section("rtc.action_merging"): # ... merging code ... pass ``` ## Analyzing Results ### Comparison Checklist When comparing RTC vs non-RTC performance, check: - [ ] Is `policy_inference` time higher with RTC? - [ ] Is `action_queue_merge` taking significant time? - [ ] Are you running enough iterations to amortize warmup? - [ ] Is torch.compile enabled for fair comparison? - [ ] Is the execution horizon large enough? (should be >= 10-20) - [ ] Are you testing on the same hardware/device? ### Common Bottlenecks 1. **Image preprocessing dominates** - Solution: Reduce image resolution, use fewer cameras, or optimize preprocessing 2. **Action queue operations are slow** - Solution: Review queue implementation, consider using ring buffer 3. **RTC guidance is expensive** - Solution: Reduce guidance weight, simplify guidance computation, use torch.compile 4. **Robot communication is slow** - Solution: Increase baud rate, reduce action frequency, optimize protocol 5. **Memory allocation overhead** - Solution: Pre-allocate buffers, reuse tensors, avoid unnecessary copies ## Advanced: Adding Custom Metrics You can add custom timing metrics to the profiled script: ```python from lerobot.utils.profiling import record_timing start = time.perf_counter() # ... your code ... duration = time.perf_counter() - start record_timing("my_custom_metric", duration) ``` ## Troubleshooting ### Profiling shows RTC is slower by >50% 1. Check if torch.compile is enabled: `--use_torch_compile` 2. Increase execution horizon: `--rtc.execution_horizon=30` 3. Verify inference_delay is calculated correctly 4. Profile with `--enable_detailed_profiling` to find exact bottleneck ### Profiling output is empty 1. Make sure profiling is enabled with `enable_profiling()` 2. Verify you're running enough iterations (at least 10) 3. Check that code is actually executing (not short-circuited) ### Inconsistent results between runs 1. Run more iterations: `--num_iterations=100` 2. Increase warmup iterations 3. Check for thermal throttling on device 4. Ensure no other processes competing for resources ## Next Steps 1. Run both profiling scripts (with/without robot) 2. Compare timing breakdowns 3. Identify the largest bottleneck 4. Focus optimization efforts on that component 5. Re-run profiling to verify improvements ## Questions? If profiling reveals unexpected bottlenecks or you need help interpreting results, please share: - The full profiling output - Your configuration (RTC enabled/disabled, execution horizon, etc.) - Hardware specs (device type, memory, etc.) - Policy type and size