8.3 KiB
RTC Profiling Guide
This guide explains how to profile RTC (Real-Time Chunking) performance to identify bottlenecks and understand why RTC might be slower than expected.
Quick Start
1. Profile with Real Robot (Profiled Version)
Use eval_with_real_robot_profiled.py to profile actual robot execution:
# With RTC enabled
uv run examples/rtc/eval_with_real_robot_profiled.py \
--policy.path=helper2424/pi05_check_rtc \
--policy.device=mps \
--rtc.enabled=true \
--rtc.execution_horizon=20 \
--robot.type=so100_follower \
--robot.port=/dev/tty.usbmodem58FA0834591 \
--robot.id=so100_follower \
--robot.cameras="{ gripper: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}, front: {type: opencv, index_or_path: 1, width: 640, height: 480, fps: 30}}" \
--task="Move green small object into the purple platform" \
--duration=30
# Without RTC for comparison
uv run examples/rtc/eval_with_real_robot_profiled.py \
--policy.path=helper2424/pi05_check_rtc \
--policy.device=mps \
--rtc.enabled=false \
--robot.type=so100_follower \
--robot.port=/dev/tty.usbmodem58FA0834591 \
--robot.id=so100_follower \
--robot.cameras="{ gripper: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}, front: {type: opencv, index_or_path: 1, width: 640, height: 480, fps: 30}}" \
--task="Move green small object into the purple platform" \
--duration=30
Output: At the end of execution, you'll see a detailed breakdown of timing for each component:
get_actions.policy_inference- Time spent in policy inferenceget_actions.preprocessing- Time spent preprocessing observationsget_actions.postprocessing- Time spent postprocessing actionsget_actions.action_queue_merge- Time spent merging actions with RTCrobot.get_observation- Time to get observations from robotrobot.send_action- Time to send actions to robot- And more...
2. Profile Without Robot (Comparison Script)
Use profile_rtc_comparison.py to profile just the policy inference without needing a robot:
uv run examples/rtc/profile_rtc_comparison.py \
--policy_path=helper2424/pi05_check_rtc \
--device=mps \
--num_iterations=50 \
--execution_horizon=20
Output: Side-by-side comparison of performance with and without RTC, including:
- Mean/min/max inference times
- Throughput (iterations per second)
- Verdict on whether RTC is faster or slower
3. Enable Detailed Method-Level Profiling
For even more granular profiling, add the --enable_detailed_profiling flag:
uv run examples/rtc/profile_rtc_comparison.py \
--policy_path=helper2424/pi05_check_rtc \
--device=mps \
--num_iterations=50 \
--execution_horizon=20 \
--enable_detailed_profiling
This will show timing for individual methods within the policy.
Understanding the Output
Key Metrics to Look At
-
get_actions.policy_inference - This should be the largest component
- If RTC is enabled, this includes the RTC guidance overhead
- Compare this with/without RTC to see the overhead
-
get_actions.preprocessing - Image preprocessing and normalization
- Should be relatively fast
- If slow, consider optimizing image processing
-
get_actions.postprocessing - Action denormalization
- Should be minimal
- If slow, check postprocessor implementation
-
get_actions.action_queue_merge - RTC-specific merging logic
- Only present when RTC is enabled
- If this is taking significant time, the RTC algorithm may need optimization
-
robot.get_observation - Robot communication overhead
- If slow, check camera/sensor latency
- Consider reducing image resolution
-
robot.send_action - Action execution overhead
- Should be very fast
- If slow, check robot communication
Expected Performance
For a typical Pi0 policy on Apple Silicon (MPS):
- Without RTC: ~100-200ms per inference
- With RTC: Should be similar or slightly faster due to action reuse
- Preprocessing: ~5-20ms depending on number of cameras
- Postprocessing: ~1-5ms
If RTC is significantly slower, likely causes:
- RTC overhead exceeds benefits - The guidance computation is expensive
- Execution horizon too small - Not reusing enough actions to amortize overhead
- No compilation - Try with
--use_torch_compile - Large prev_actions buffer - Copying/processing previous actions is slow
Profiling Your Own Code
Using the Profiling Decorator
Add profiling to your own methods:
from lerobot.utils.profiling import profile_method, enable_profiling, print_profiling_summary
# Enable profiling
enable_profiling()
# Decorate methods you want to profile
@profile_method
def my_slow_function(x):
# ... your code ...
return result
# At end of execution
print_profiling_summary()
Using Profile Context Manager
For profiling specific code blocks:
from lerobot.utils.profiling import profile_section, enable_profiling
enable_profiling()
with profile_section("data_loading"):
data = load_data()
with profile_section("model_inference"):
output = model(data)
Adding Profiling to Policy Methods
To profile specific parts of the Pi0 policy, you can add decorators:
# In src/lerobot/policies/pi0/modeling_pi0.py
from lerobot.utils.profiling import profile_method, profile_section
class Pi0Policy:
@profile_method
def predict_action_chunk(self, obs, inference_delay=0, prev_chunk_left_over=None):
# ... existing code ...
pass
def _generate_actions_with_rtc(self, ...):
with profile_section("rtc.guidance_computation"):
# ... guidance code ...
pass
with profile_section("rtc.action_merging"):
# ... merging code ...
pass
Analyzing Results
Comparison Checklist
When comparing RTC vs non-RTC performance, check:
- Is
policy_inferencetime higher with RTC? - Is
action_queue_mergetaking significant time? - Are you running enough iterations to amortize warmup?
- Is torch.compile enabled for fair comparison?
- Is the execution horizon large enough? (should be >= 10-20)
- Are you testing on the same hardware/device?
Common Bottlenecks
-
Image preprocessing dominates
- Solution: Reduce image resolution, use fewer cameras, or optimize preprocessing
-
Action queue operations are slow
- Solution: Review queue implementation, consider using ring buffer
-
RTC guidance is expensive
- Solution: Reduce guidance weight, simplify guidance computation, use torch.compile
-
Robot communication is slow
- Solution: Increase baud rate, reduce action frequency, optimize protocol
-
Memory allocation overhead
- Solution: Pre-allocate buffers, reuse tensors, avoid unnecessary copies
Advanced: Adding Custom Metrics
You can add custom timing metrics to the profiled script:
from lerobot.utils.profiling import record_timing
start = time.perf_counter()
# ... your code ...
duration = time.perf_counter() - start
record_timing("my_custom_metric", duration)
Troubleshooting
Profiling shows RTC is slower by >50%
- Check if torch.compile is enabled:
--use_torch_compile - Increase execution horizon:
--rtc.execution_horizon=30 - Verify inference_delay is calculated correctly
- Profile with
--enable_detailed_profilingto find exact bottleneck
Profiling output is empty
- Make sure profiling is enabled with
enable_profiling() - Verify you're running enough iterations (at least 10)
- Check that code is actually executing (not short-circuited)
Inconsistent results between runs
- Run more iterations:
--num_iterations=100 - Increase warmup iterations
- Check for thermal throttling on device
- Ensure no other processes competing for resources
Next Steps
- Run both profiling scripts (with/without robot)
- Compare timing breakdowns
- Identify the largest bottleneck
- Focus optimization efforts on that component
- Re-run profiling to verify improvements
Questions?
If profiling reveals unexpected bottlenecks or you need help interpreting results, please share:
- The full profiling output
- Your configuration (RTC enabled/disabled, execution horizon, etc.)
- Hardware specs (device type, memory, etc.)
- Policy type and size