A curated collection of robotics papers focused on real-world reliability and robustness. Originally a personal reference, I’m sharing this list in hopes it helps others.
Prerequisite: must include real-world results.
Contributions are welcome!
| Name | Date | Categories | Real World Success Rate | Project | Paper | Code | Organization(s) | Notes |
|---|---|---|---|---|---|---|---|---|
| Contact-Anchored Policies: Contact Conditioning Creates Strong Robot Utility Models | 02/2026 | Representation Learning |
Average zero-shot success with Picking, Opening, and Closing tasks across 4 robot arms shown below.
|
Project | Paper | Code | NYU, UC Berkeley, UCLA, Hello Robot, Ai2, University of Waterloo | Replaces language conditioning with physical contact points. Uses VQ-BeT architecture with contact anchors. Trained on handheld gripper data, generalizes zero-shot to multiple robot embodiments. |
| LingBot-VA: Causal World Modeling for Robot Control | 01/2026 | World Models |
Real-world Success Rate (SR) / Progress Score (PS): Make Breakfast 75% SR/97% PS, Pick Screws 70% SR/82.5% PS, Fold Clothes 35% SR/48.8% PS, Unpack Delivery 65% SR/84.5% PS, Insert Tubes 40% SR/85.8% PS, Fold Pants 70% SR/76.7% PS. Achieves >20% improvement over π0.5 on challenging tasks with only 50 demos.
|
Project | Paper | Code | Ant Group/Alibaba | A powerful autoregressive diffusion framework (5.3B params) that predicts both video movement and actions together via Mixture-of-Transformers architecture. It efficiently plans ahead, learns quickly from data, and handles new situations well. Great at complex, long-term tasks. |
| Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning | 01/2026 | World Models |
93.6% average success on challenging real-world ALOHA bimanual manipulation tasks. With model-based planning, achieves 12.5% higher task completion rate on challenging real-world tasks.
|
Project | Paper | Code | NVIDIA, Stanford University | Single-stage fine-tuning of pretrained video model (Cosmos-Predict2-2B) that generates robot actions, future states, and values as latent frames. Uses model-based planning with best-of-N sampling to achieve higher success rates. Can learn from policy rollout data to refine world model and value function. |
| Does learning from experience benefit small AI robotics models? | 12/2025 | Imitation Learning |
4/5 when training simple ACT on imitation + corrections only. | Article | Replicating the RL loop behind Physical Intelligence's Pi*0.6 foundation model without VLAs or diffusion. | |||
| π*0.6 : a VLA That Learns From Experience | 11/2025 | VLA |
The system ran for 13 hours straight making espresso drinks and over two hours folding novel laundry items without interruptions. Success Rates: Laundry (t-shirts & shorts) ~95%, Laundry (Diverse Hardest Items) ~70%, Make Espresso ~90%, Box Assembly ~90%
|
Project | Paper | Physical Intelligence | RECAP is an iterated offline RL framework that improves a Vision-Language-Action (VLA) model ( |
|
| RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning | 10/2025 | Online RL |
100% success across 7 tasks. 92.5% average zero-shot success on 3 tasks (without any retraining or fine-tuning), 86.7% average few-shot success on 3 tasks![]() ![]()
|
Project | Paper | Shanghai Qizi, Shanghai Jiao Tong, HKU, UNC Chapel Hill | code to be released "after paper is accepted" | |
| APO: Human-assisted Robotic Policy Refinement via Action Preference Optimization | 10/2025 | Human-in-the-loop |
Improvement on success rates of Dagger, TPO, etc on in-distribution, as well as when position, background, or texture are disrupted.
|
Project | Paper | Code | ByteDance | |
| HI-ORS: Human-in-the-loop Online Rejection Sampling for Robotic Manipulation | 10/2025 | Human-in-the-loop |
Improved RW Success Rates vs vanilla BC, HIL-SERL, Q-Chunking.
|
Project | Paper | Code | TenCent | |
| ARMADA/FLOAT: Autonomous Online Failure Detection and Human Shared Control Empower Scalable Real-world Deployment and Adaptation | 10/2025 | failure detector FLOAT achieves nearly 95% accuracy on average, surpassing prior SOTA failure detection approaches by > 20%.
|
Project | Paper | Code | Shanghai Jiao Tong University | ||
| SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation | 09/2025 | Rewards |
83% success on folding T-shirts (flattened), 67% success on folding T-shirts (crumpled). Surpasses vanilla BC (8% and 0%). | Project | Paper | Code in LeRobot | Stanford, UC Berkeley, xdof.ai | Video-based reward modeling framework that jointly predicts high-level task stages and fine-grained progress. |
| Dual-Actor Fine-Tuning of VLA Models: A Talk-and-Tweak Human-in-the-Loop Approach | 09/2025 | VLA |
100% success across three tasks within 101 minutes of online fine-tuning. For long-horizon tasks, it sustains a 50% success rate over 12 consecutive operations.
|
Project | Paper | Zhejiang & others | no code : ( | |
| WSRL: Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data | 07/2025 | Online RL |
100% success rate on Franka peg insertion task in 18 minutes, SERL fails (0/20) even with 50 minutes.
|
Project | Paper | Code | UC Berkeley | Overall idea: No data retention during fine-tuning, warmup phase with small rollouts from pre-trained policy. Unfortunately, only 1 real world experiment, all others in sim. |
| Dyna Robotics (Unknown Model) | 07/2025 | 99.9% success rate in folding towels for 8 hours/day over 3 days (dropped 1 towel on day 2). No intervention. | Project | Dyna Robotics | ||||
| Figure (Helix) | 06/2025 | ~95% accuracy at correctly orienting barcodes. 4.05 seconds per package. | Project | Figure | Adds memory for more robust, long-term tasks and force feedback for improved grip. | |||
| RSS 2025 Workshop: Human-in-the-Loop Robot Learning: Teaching, Correcting, and Adapting | 06/2025 | various results | Project | various universities | ||||
| Compliant Residual DAgger: Improving Real-World Contact-Rich Manipulation with Human Corrections | 06/2025 | Human-in-the-loop |
book-flipping success rate of 100% (60% improvement) and belt assembly success of 70% (50% improvement)
|
Project | Paper | Code | Stanford | |
| ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations | 05/2025 | Rewards |
an hour of real-world RL improves success rate from 12% to 68%, vs 8% to 10% with VLC
|
Project | Paper | Code | U Wash | |
| Dyna Robotics DYNA-1 Model | 04/2025 | 99.4% success rate in folding napkins over 24 hours. No intervention. Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â | Project | Dyna Robotics | ||||
| ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy | 02/2025 | VLA |
96.3% avg success rate across tasks, compared to 31.9% w/ HIL-SERL
|
Paper | Code | Chinese Academy of Sciences | Online and offline fine-tuning. | |
| HIL-SERL: Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning | 10/2024 | Online RL |
100% success rate on a variety of tasks
|
Project | Paper | Code | UC Berkeley | Online fine-tuning, human intervention allowed. Implementation available in LeRobot. |
| RLIF: INTERACTIVE IMITATION LEARNING AS REINFORCEMENT LEARNING | 03/2024 | Imitation Learning |
95% success rate in cloth unfolding within 7 rounds, 100% rate success in peg insertion within 6 rounds
|
Project | Paper | Code | UC Berkeley | |
| SERL: A Software Suite for Sample-Efficient Robotic Reinforcement Learning | 01/2024 | Online RL |
100% success on PCB insertion, cable routing, object relocation
|
Project | Paper | Code | UC Berkeley |


















