Abstract
Humans excel at isolating relevant information from noisy data to predict the behavior of dynamic systems, effectively disregarding non-informative, temporally-correlated noise. In contrast, existing visual reinforcement learning algorithms face challenges in generating noise-free predictions within high-dimensional, noise-saturated environments, especially when trained on world models featuring realistic background noise extracted from natural video streams. We propose Task Relevant Mask Sampling (TRMS), a novel approach for identifying task-specific and reward-relevant masks. TRMS utilizes existing segmentation models as a masking prior, which is subsequently followed by a mask selector that dynamically identifies subset of masks at each timestep, selecting those most probable to contribute to task-specific rewards. To mitigate the high computational cost associated with these masking priors, a lightweight student network is trained in parallel. This network learns to perform masking independently and replaces the Segment Anything Model~(SAM)-based teacher network after a brief initial phase (<10-25% of total training). TRMS enhances the generalization capabilities of Soft Actor-Critic agents under distractions, achieves better performance on the RL-Vigen benchmark, which includes challenging variants of the DeepMind Control Suite, Dexterous Manipulation and Quadruped Locomotion tasks.
| Original language | English |
|---|---|
| Number of pages | 25 |
| Journal | Transactions on Machine Learning Research |
| Publication status | Published - 2025 |
ASJC Scopus subject areas
- Computer Vision and Pattern Recognition
- Artificial Intelligence
Fields of Expertise
- Information, Communication & Computing
Fingerprint
Dive into the research topics of 'Learning robust representations for visual reinforcement learning via task-relevant mask sampling'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS