[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--强化学习

专属领域论文订阅

关注{晓理紫|小李子}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文。

在这里插入图片描述

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉语言导航VLN
强化学习 RL
模仿学习 IL
机器人
开放词汇，检测分割

== RLHF ==

标题: Classification with Costly Features in Hierarchical Deep Sets

作者: Jaromír Janisch, Tomáš Pevný, Viliam Lisý

PubTime: 2024-02-29

Downlink: http://arxiv.org/abs/1911.08756v6

GitHub: https://github.com/jaromiru/rcwcf|

中文摘要: 具有昂贵特征的分类（CwCF）是一个分类问题，它在优化标准中包括特征的成本。对于每个样本，依次获取其特征，以最大限度地提高准确性，同时最小化获取特征的成本。然而，现有的方法只能处理可以表示为固定长度向量的数据。在现实生活中，数据往往具有丰富而复杂的结构，用XML或JSON等格式可以更精确地描述。数据是分层的，通常包含嵌套的对象列表。在这项工作中，我们用分层深度集和分层softmax扩展了现有的基于深度强化学习的算法，以便它可以直接处理这些数据。扩展方法对它可以获取的特征有更大的控制，并且在对七个数据集的实验中，我们表明这导致了更好的性能。为了展示新方法的实际用途，我们使用在线服务将其应用于对恶意web域进行分类的现实问题。

摘要: Classification with Costly Features (CwCF) is a classification problem that includes the cost of features in the optimization criteria. Individually for each sample, its features are sequentially acquired to maximize accuracy while minimizing the acquired features’ cost. However, existing approaches can only process data that can be expressed as vectors of fixed length. In real life, the data often possesses rich and complex structure, which can be more precisely described with formats such as XML or JSON. The data is hierarchical and often contains nested lists of objects. In this work, we extend an existing deep reinforcement learning-based algorithm with hierarchical deep sets and hierarchical softmax, so that it can directly process this data. The extended method has greater control over which features it can acquire and, in experiments with seven datasets, we show that this leads to superior performance. To showcase the real usage of the new method, we apply it to a real-life problem of classifying malicious web domains, using an online service.

标题: Leveraging RL for Efficient Collection of Perception Messages in Vehicular Networks

作者: Chaima Zoghlami, Rahim Kacimi, Riadh Dhaou

PubTime: 2024-02

Downlink: https://ieeexplore.ieee.org/document/10449924/

Journal: 2024 Global Information Infrastructure and Networking Symposium (GIIS)

中文摘要: 协作消息通过增强态势感知、支持防撞和提高交通效率，在车辆对一切（V2X）应用中发挥着至关重要的作用。此外，它们通过增强环境感知，有助于弱势道路使用者（VRU）的安全。本文的目的是介绍一种新的Q学习技术，它可以改进协作消息的类型、大小和频率的选择。该方法基于利用车载网络中现有消息的多样性来确定具有适当大小的最佳消息类型，同时根据环境上下文调整其传输频率，以有效地管理网络资源。除了减轻网络过载和减少同时发送的消息数量之外，当vru被联网和/或自动驾驶车辆（CAV）识别时，我们的方法应用于vru时，还可以显著节省能源。

摘要: Cooperative messages play a vital role in vehicle-to-everything (V2X) applications by enhancing situational awareness, supporting collision avoidance and improving traffic efficiency. Additionally, they contribute to Vulnerable Road Users (VRU) safety by increasing environment perception. The purpose of this paper is to introduce a novel Q-Learning technique that can improve the selection of cooperative messages’ type, size, and frequency. The methodology is based on leveraging the diversity of existing messages in vehicular networks to determine the best message type with the appropriate size while adjusting its transmission frequency according to the environmental context to efficiently manage network resources. In addition to alleviating the network overload and decreasing the number of messages sent simultaneously, our method could result in significant energy savings when applied to VRUs when they are identified by Connected and or Autonomous Vehicles (CAV).

标题: Concurrent Learning of Policy and Unknown Safety Constraints in Reinforcement Learning

作者: Lunet Yifru, Ali Baheri

PubTime: 2024-02-29

Downlink: http://arxiv.org/abs/2402.15893v2

中文摘要: 在过去的几十年里，强化学习（RL）已经彻底改变了广泛领域的决策。然而，在现实世界场景中部署RL策略带来了确保安全性的关键挑战。传统的safe RL方法主要集中在将预定义的安全约束整合到策略学习过程中。然而，这种对预定义安全约束的依赖在动态和不可预测的真实世界设置中造成了限制，在这些设置中，这些约束可能不可用或适应性不足。为了弥补这一差距，我们提出了一种新的方法，同时学习安全的RL控制策略，并识别给定环境的未知安全约束参数。使用参数信号时态逻辑（pSTL）安全规范和小的初始标记数据集进行初始化，我们将该问题框架为双层优化任务，使用双延迟深度确定性策略梯度（TD3）算法的拉格朗日变体，与贝叶斯优化错综复杂地集成约束策略优化，以优化给定pSTL安全规范的参数。通过综合案例研究的实验，我们验证了这种方法在不同形式的环境约束下的有效性，始终产生具有高回报的安全RL政策。此外，我们的发现表明STL安全约束参数的成功学习，表现出与真实环境安全约束的高度一致性。我们的模型的性能密切反映了拥有安全约束的完整先验知识的理想场景的性能，证明了它在准确识别环境安全约束和学习遵守这些约束的安全策略方面的熟练程度。

摘要: Reinforcement learning (RL) has revolutionized decision-making across a wide range of domains over the past few decades. Yet, deploying RL policies in real-world scenarios presents the crucial challenge of ensuring safety. Traditional safe RL approaches have predominantly focused on incorporating predefined safety constraints into the policy learning process. However, this reliance on predefined safety constraints poses limitations in dynamic and unpredictable real-world settings where such constraints may not be available or sufficiently adaptable. Bridging this gap, we propose a novel approach that concurrently learns a safe RL control policy and identifies the unknown safety constraint parameters of a given environment. Initializing with a parametric signal temporal logic (pSTL) safety specification and a small initial labeled dataset, we frame the problem as a bilevel optimization task, intricately integrating constrained policy optimization, using a Lagrangian-variant of the twin delayed deep deterministic policy gradient (TD3) algorithm, with Bayesian optimization for optimizing parameters for the given pSTL safety specification. Through experimentation in comprehensive case studies, we validate the efficacy of this approach across varying forms of environmental constraints, consistently yielding safe RL policies with high returns. Furthermore, our findings indicate successful learning of STL safety constraint parameters, exhibiting a high degree of conformity with true environmental safety constraints. The performance of our model closely mirrors that of an ideal scenario that possesses complete prior knowledge of safety constraints, demonstrating its proficiency in accurately identifying environmental safety constraints and learning safe policies that adhere to those constraints.

标题: Emergent Dominance Hierarchies in Reinforcement Learning Agents

作者: Ram Rachum, Yonatan Nakar, Bill Tomlinson

PubTime: 2024-02-27

Downlink: http://arxiv.org/abs/2401.12258v4

中文摘要: 现代强化学习（RL）算法能够在各种各样的任务中胜过人类。多智能体强化学习（MARL）设置提出了额外的挑战，混合动机智能体组中的成功合作取决于个人和群体目标之间的微妙平衡。通常受人类制度启发的社会习俗和规范被用作实现这种平衡的工具。在本文中，我们考察了一个基本的、经过充分研究的社会惯例，它是动物和人类社会合作的基础：统治等级。我们将优势等级的行为学理论应用于人工代理，借用已建立的术语和定义，尽可能少地修改。我们证明了RL代理人的群体，在没有显式编程或内在奖励的情况下运作，可以发明、学习、执行并向新群体传递优势等级。出现的优势等级与在鸡、老鼠、鱼和其他物种中研究的结构相似。

摘要: Modern Reinforcement Learning (RL) algorithms are able to outperform humans in a wide variety of tasks. Multi-agent reinforcement learning (MARL) settings present additional challenges, and successful cooperation in mixed-motive groups of agents depends on a delicate balancing act between individual and group objectives. Social conventions and norms, often inspired by human institutions, are used as tools for striking this balance. In this paper, we examine a fundamental, well-studied social convention that underlies cooperation in both animal and human societies: dominance hierarchies. We adapt the ethological theory of dominance hierarchies to artificial agents, borrowing the established terminology and definitions with as few amendments as possible. We demonstrate that populations of RL agents, operating without explicit programming or intrinsic rewards, can invent, learn, enforce, and transmit a dominance hierarchy to new populations. The dominance hierarchies that emerge have a similar structure to those studied in chickens, mice, fish, and other species.

标题: Reinforcement learning-assisted quantum architecture search for variational quantum algorithms

作者: Akash Kundu

PubTime: 2024-02-27

Downlink: http://arxiv.org/abs/2402.13754v2

中文摘要: 嘈杂的中尺度量子（NISQ）时代的一个重大障碍是识别功能量子电路。这些电路还必须遵守当前量子硬件限制所施加的约束。变分量子算法（VQAs）是一类量子经典优化算法，旨在解决当前可用量子器件中的这些挑战。然而，VQAs的整体性能取决于变分电路的初始化策略、电路的结构（也称为ansatz）和成本函数的配置。围绕电路的结构，在这篇论文中，我们通过使用强化学习（RL）自动搜索变分电路的最佳结构来提高VQAs的性能。在本文中，电路的最优性是通过评估它的深度、门和参数的总数以及它在解决给定问题时的准确性来确定的。自动搜索最佳量子电路的任务被称为量子架构搜索（QAS）。QAS的大部分研究主要集中在无声场景上。然而，噪声对QAS的影响仍未得到充分探讨。在这篇论文中，我们通过引入基于张量的量子电路编码、对环境动力学的限制来有效地探索可能电路的搜索空间、引导代理寻找更短电路的插曲停止方案、具有 $\epsilon$ -贪婪策略的双深度Q-网络（DDQN）来解决这个问题，以获得更好的稳定性。在无噪声和有噪声的量子硬件上的数值实验表明，在处理各种VQA时，我们基于RL的QAS优于现有的QAS。同时，我们在论文中提出的方法可以很容易地适用于解决广泛的其他VQA。

摘要: A significant hurdle in the noisy intermediate-scale quantum (NISQ) era is identifying functional quantum circuits. These circuits must also adhere to the constraints imposed by current quantum hardware limitations. Variational quantum algorithms (VQAs), a class of quantum-classical optimization algorithms, were developed to address these challenges in the currently available quantum devices. However, the overall performance of VQAs depends on the initialization strategy of the variational circuit, the structure of the circuit (also known as ansatz), and the configuration of the cost function. Focusing on the structure of the circuit, in this thesis, we improve the performance of VQAs by automating the search for an optimal structure for the variational circuits using reinforcement learning (RL). Within the thesis, the optimality of a circuit is determined by evaluating its depth, the overall count of gates and parameters, and its accuracy in solving the given problem. The task of automating the search for optimal quantum circuits is known as quantum architecture search (QAS). The majority of research in QAS is primarily focused on a noiseless scenario. Yet, the impact of noise on the QAS remains inadequately explored. In this thesis, we tackle the issue by introducing a tensor-based quantum circuit encoding, restrictions on environment dynamics to explore the search space of possible circuits efficiently, an episode halting scheme to steer the agent to find shorter circuits, a double deep Q-network (DDQN) with an $\epsilon$ -greedy policy for better stability. The numerical experiments on noiseless and noisy quantum hardware show that in dealing with various VQAs, our RL-based QAS outperforms existing QAS. Meanwhile, the methods we propose in the thesis can be readily adapted to address a wide range of other VQAs.

标题: Simple, unified analysis of Johnson-Lindenstrauss with applications

作者: Yingru Li

PubTime: 2024-02-27

Downlink: http://arxiv.org/abs/2402.10232v3

摘要: We present a simple and unified analysis of the Johnson-Lindenstrauss (JL) lemma, a cornerstone in the field of dimensionality reduction critical for managing high-dimensional data. Our approach not only simplifies the understanding but also unifies various constructions under the JL framework, including spherical, binary-coin, sparse JL, Gaussian and sub-Gaussian models. This simplification and unification make significant strides in preserving the intrinsic geometry of data, essential across diverse applications from streaming algorithms to reinforcement learning. Notably, we deliver the first rigorous proof of the spherical construction’s effectiveness and provide a general class of sub-Gaussian constructions within this simplified framework. At the heart of our contribution is an innovative extension of the Hanson-Wright inequality to high dimensions, complete with explicit constants. By employing simple yet powerful probabilistic tools and analytical techniques, such as an enhanced diagonalization process, our analysis not only solidifies the JL lemma’s theoretical foundation by removing an independence assumption but also extends its practical reach, showcasing its adaptability and importance in contemporary computational algorithms.

== Imitation Learning ==

标题: Robot at the Mirror: Learning to Imitate via Associating Self-supervised Models

作者: Andrej Lucny, Kristina Malinovska, Igor Farkas

PubTime: 2024-02-26

Downlink: http://arxiv.org/abs/2311.13226v2

Project: https://link.springer.com/chapter/10.1007/978-3-031-44207-0_39|

GitHub: https://github.com/andylucny/learningImitation/tree/main/mirror|

中文摘要: 我们介绍了一种从现成的自我监督模型通过关联而不是训练和微调来构建定制模型的方法。我们用一个人形机器人看着镜子并学习从它感知的图像中检测自己身体的3D姿势的例子来演示它。为了建立我们的模型，我们首先通过机器人操作前准备的模型从视觉输入和机器人身体的姿势中获得特征。然后，我们通过一个样本高效的机器人在镜子上的自我探索来映射它们相应的潜在空间。通过这种方式，机器人构建请求的3D姿态检测器，该检测器在采集的样本上立即获得完美的质量，而不是逐渐获得质量。该映射采用特征向量对的关联，然后以与著名的Transformer model模型的键值机制相同的方式实现。最后，将我们的模型部署到模拟机器人上，使我们能够在没有人类参与的情况下研究、调整和系统地评估其超参数，推进我们之前的研究。

摘要: We introduce an approach to building a custom model from ready-made self-supervised models via their associating instead of training and fine-tuning. We demonstrate it with an example of a humanoid robot looking at the mirror and learning to detect the 3D pose of its own body from the image it perceives. To build our model, we first obtain features from the visual input and the postures of the robot’s body via models prepared before the robot’s operation. Then, we map their corresponding latent spaces by a sample-efficient robot’s self-exploration at the mirror. In this way, the robot builds the solicited 3D pose detector, which quality is immediately perfect on the acquired samples instead of obtaining the quality gradually. The mapping, which employs associating the pairs of feature vectors, is then implemented in the same way as the key-value mechanism of the famous transformer models. Finally, deploying our model for imitation to a simulated robot allows us to study, tune up, and systematically evaluate its hyperparameters without the involvement of the human counterpart, advancing our previous research.

标题: Visualization of Game Strategies from Imitation Learning Agents

作者: Ueno Masayuki, Takami Tomoyuki

PubTime: 2023-10

Downlink: https://ieeexplore.ieee.org/document/10315659/

Journal: 2023 IEEE 12th Global Conference on Consumer Electronics (GCCE)

中文摘要: 通过机器学习人类学习者在特定情况下的行为，可以将代理用作行为类似人类的代理。这种模仿学习代理可以用于各种教育目的。通过使用XAI技术来可视化模仿学习代理如何评估棋盘，它可以用来学习自己和他人的策略。

摘要: It is possible to use the agent as an agent that behaves like a human by machine learning how a human learner behaves in a certain situation. Such an imitation learning agent can be used for various educational purposes. It can be used to learn its own strategies and those of others by using XAI technology to visualize how the imitation learning agent evaluates the board.

标题: Scaling Data Generation in Vision-and-Language Navigation

作者: Zun Wang, Jialu Li, Yicong Hong

PubTime: 2023-10

Downlink: https://ieeexplore.ieee.org/document/10378395/

Journal: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

中文摘要: 最近在语言引导的视觉导航方面的研究表明，对可穿越环境的多样性和训练可概括代理的监督量有很大的需求。为了解决现有视觉和语言导航数据集中常见的数据稀缺问题，我们提出了一种生成大规模学习数据的有效范式，该范式应用了来自HM3D和Gibson数据集的1200多个照片级逼真环境，并使用网络上完全可访问的资源合成了490万个指令轨迹对。重要的是，我们调查了该范式中每个组件对代理性能的影响，并研究了如何充分应用增强的数据来预训练和微调代理。由于我们的大规模数据集，现有代理的性能可以被提升（相对于以前的SoTA，绝对+11%）到R2R测试中80%的单次运行成功率的显著新的最佳水平，通过简单的模仿学习分割。在可见和不可见环境中导航之间的长期泛化差距也减少到不到1%（与之前的最佳方法中的8%相比）。此外，我们的范式还有助于不同的模型在连续环境中在CVDN、遐想和R2R上实现新的最先进的导航结果。

摘要: Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents. To tackle the common data scarcity issue in existing vision-and-language navigation datasets, we propose an effective paradigm for generating large-scale data for learning, which applies 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction-trajectory pairs using fully-accessible resources on the web. Importantly, we investigate the influence of each component in this paradigm on the agent’s performance and study how to adequately apply the augmented data to pre-train and fine-tune an agent. Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning. The long-lasting generalization gap between navigating in seen and unseen environments is also reduced to less than 1% (versus 8% in the previous best method). Moreover, our paradigm also facilitates different models to achieve new state-of-the-art navigation results on CVDN, REVERIE, and R2R in continuous environments.

标题: Cosine Similarity Based Representation Learning for Adversarial Imitation Learning

作者: Xiongzhen Zhang, Quan Liu, Lihua Zhang

PubTime: 2023-10

Downlink: https://ieeexplore.ieee.org/document/10394257/

Journal: 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC)

中文摘要: 对抗性模仿学习（AIL）旨在从专家演示中恢复奖励信号，并通过采用奖励和强化学习来学习专家策略。然而，演示的原始状态——动作特征通常具有特定控制任务的冗余信息，因此从原始特征中学习的奖励通常是有偏差的，这最终导致AIL中的低样本效率和不稳定性。为了解决这些问题，我们提出了CSAIL：基于余弦相似性的对抗性模仿学习。CSAIL通过一种新的基于余弦相似性的损失从演示中提取专家策略表示，并通过学习的表示恢复鲁棒和无偏的奖励函数。基于奖励，CSAIL通过Wasserstein距离优化方法模拟专家策略。实验结果表明，在具有挑战性的Mujoco机器人控制和自动驾驶任务方面，CSAIL优于现有的最先进的AIL方法。

摘要: Adversarial imitation learning (AIL) aims to recover the reward signal from expert demonstrations and learn expert policy by employing reward and reinforcement learning. However, the raw state-action features of the demonstrations usually have redundant information for a particular control task, and therefore the reward learned from the raw features is often biased, which eventually results in low sample efficiency and instability in AIL. To address these issues, we present CSAIL: Cosine Similarity based Adversarial Imitation Learning. CSAIL extracts expert policy representations from demonstrations via a novel cosine similarity based loss and recovers a robust and unbiased reward function by the learned representations. Based on the reward, CSAIL mimics the expert policy by the Wasserstein distance optimization method. Experimental results show that CSAIL outperforms existing state-of-the-art AIL methods on challenging Mujoco robot control and autonomous driving tasks.

标题: Curriculum-Based Imitation of Versatile Skills

作者: Maximilian Xiling Li, Onur Celik, Philipp Becker

PubTime: 2023-06

Downlink: https://ieeexplore.ieee.org/document/10160543/

Journal: 2023 IEEE International Conference on Robotics and Automation (ICRA)

GitHub: https://github.com/intuitive-robots/ML-Cur|https://github.com/intuitive-robots/ML-Cur|

中文摘要: 通过模仿学习技能对于机器人的直观教学来说是一个很有前途的概念。学习这种技能的一种常见方法是通过最大化给定演示的可能性来学习参数模型。然而，人类演示通常是多模态的，即同一任务以多种方式解决，这对于基于这种最大似然（ML）目标的大多数模仿学习方法来说是一个主要挑战。ML目标迫使模型覆盖所有数据，它防止上下文空间中的专门化，并可能导致行为空间中的模式平均，从而导致次优或潜在的灾难性行为。在这里，我们通过引入一个对每个数据点使用权重的课程来缓解这些问题，允许模型专注于它可以表示的数据，同时通过熵奖励激励它覆盖尽可能多的数据。我们将我们的算法扩展到（线性）专家（MoE）的混合，使得单个组件可以专注于局部上下文区域，而MoE覆盖所有数据点。我们在复杂的模拟和真实机器人控制任务中评估了我们的方法，并表明它从多种人类演示中学习，并显著优于当前的SOTA方法。<sup xmlns：mml=“http://www.w3.org/1998/Math/MathML”xmlns：xlink=“http://www.w3.org/1999/xlink”>1<sup xmlns：mml=“http://www.w3.org/1998/Math/MathML”xmlns：xlink=“http://www.w3.org/1999/xlink”>1参考实现可在https://github.com/intuitive-robots/ML-Cur

摘要: Learning skills by imitation is a promising concept for the intuitive teaching of robots. A common way to learn such skills is to learn a parametric model by maximizing the likelihood given the demonstrations. Yet, human demonstrations are often multi-modal, i.e., the same task is solved in multiple ways which is a major challenge for most imitation learning methods that are based on such a maximum likelihood (ML) objective. The ML objective forces the model to cover all data, it prevents specialization in the context space and can cause mode-averaging in the behavior space, leading to suboptimal or potentially catastrophic behavior. Here, we alleviate those issues by introducing a curriculum using a weight for each data point, allowing the model to specialize on data it can represent while incentivizing it to cover as much data as possible by an entropy bonus. We extend our algorithm to a Mixture of (linear) Experts (MoE) such that the single components can specialize on local context regions, while the MoE covers all data points. We evaluate our approach in complex simulated and real robot control tasks and show it learns from versatile human demonstrations and significantly outperforms current SOTA methods.
¹
¹
A reference implementation can be found at https://github.com/intuitive-robots/ML-Cur

标题: A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning

作者: Aishwarya Kamath, Peter Anderson, Su Wang

PubTime: 2023-06

Downlink: https://ieeexplore.ieee.org/document/10205492/

Journal: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

摘要: Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding. Pretraining on large text and image-text datasets from the web has been extensively explored but the improvements are limited. We investigate large-scale augmentation with synthetic instructions. We take 500+ indoor environments captured in densely-sampled 360 ° panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory using Marky [63], a high-quality multilingual navigation instruction generator. We also synthesize image observations from novel viewpoints using an image-to-image GAN [27]. The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets, and contains a wider variety of environments and viewpoints. To efficiently leverage data at this scale, we train a simple transformer agent with imitation learning. On the challenging RxR dataset, our approach outperforms all existing RL agents, improving the state-of-the-art NDTW from 71.1 to 79.1 in seen environments, and from 64.6 to 66.8 in unseen test environments. Our work points to a new path to improving instruction-following agents, emphasizing large-scale training on near-human quality synthetic instructions.

== human robot interaction ==

标题: ROS-Causal: A ROS-based Causal Analysis Framework for Human-Robot Interaction Applications

作者: Luca Castri, Gloria Beraldo, Sariah Mghames

PubTime: 2024-02-29

Downlink: http://arxiv.org/abs/2402.16068v2

GitHub: https://github.com/lcastri/roscausal.git|

中文摘要: 在人类共享空间部署机器人需要了解附近代理和对象之间的交互。通过因果推理模拟因果关系有助于预测人类行为和机器人干预。然而，一个关键的挑战出现了，因为现有的因果发现方法目前缺乏在机器人学事实上的标准ROS生态系统内的实施，阻碍了机器人学的有效利用。为了解决这一差距，本文引入了ROS-Causal，这是一个基于ROS的框架，用于人机空间交互中的机载数据收集和因果发现。一个与ROS集成的特设模拟器说明了该方法的有效性，展示了机器人在数据收集期间板载因果模型的生成。ROS-Causal可以在GitHub上找到：https://github.com/lcastri/roscausal.git。

摘要: Deploying robots in human-shared spaces requires understanding interactions among nearby agents and objects. Modelling cause-and-effect relations through causal inference aids in predicting human behaviours and anticipating robot interventions. However, a critical challenge arises as existing causal discovery methods currently lack an implementation inside the ROS ecosystem, the standard de facto in robotics, hindering effective utilisation in robotics. To address this gap, this paper introduces ROS-Causal, a ROS-based framework for onboard data collection and causal discovery in human-robot spatial interactions. An ad-hoc simulator, integrated with ROS, illustrates the approach’s effectiveness, showcasing the robot onboard generation of causal models during data collection. ROS-Causal is available on GitHub: https://github.com/lcastri/roscausal.git.

== Object Detection@ Segmentation@Open vocabulary detection@SAM ==

标题: OK-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics

作者: Peiqi Liu, Yaswanth Orru, Jay Vakil

PubTime: 2024-02-29

Downlink: http://arxiv.org/abs/2401.12202v2

Project: https://ok-robot.github.io|

GitHub: https://github.com/ok-robot/ok-robot|

中文摘要: 近年来在视觉、语言和机器人领域取得了显著进展。我们现在有了能够基于语言查询识别物体的视觉模型，可以有效控制移动系统的导航系统，以及可以处理各种物体的抓取模型。尽管取得了这些进步，机器人的通用应用仍然落后，尽管它们依赖于识别、导航和抓取这些基本能力。在本文中，我们采用系统优先的方法来开发一个新的开放的基于知识的机器人框架，称为OK-Robot。通过结合用于对象检测的视觉语言模型（VLMs）、用于运动的导航图元和用于对象操作的抓取图元，OK-Robot提供了一个无需任何训练即可进行拖放操作的集成解决方案。为了评估它的性能，我们在10个真实的家庭环境中运行OK-Robot。结果表明，OK-Robot在开放式拾取和放下任务中实现了58.5%的成功率，代表了开放词汇移动操作（OVMM）的新水平，性能是先前工作的近1.8倍。在更干净、整洁的环境中，OK-Robot的性能提高到82%。然而，从OK-Robot获得的最重要的见解是，当将VLMs等开放知识系统与机器人模块相结合时，细微细节的关键作用。我们的实验和代码的视频可以在我们的网站上找到：https：//ok-robot.github.io

摘要: Remarkable progress has been made in recent years in the fields of vision, language, and robotics. We now have vision models capable of recognizing objects based on language queries, navigation systems that can effectively control mobile systems, and grasping models that can handle a wide range of objects. Despite these advancements, general-purpose applications of robotics still lag behind, even though they rely on these fundamental capabilities of recognition, navigation, and grasping. In this paper, we adopt a systems-first approach to develop a new Open Knowledge-based robotics framework called OK-Robot. By combining Vision-Language Models (VLMs) for object detection, navigation primitives for movement, and grasping primitives for object manipulation, OK-Robot offers a integrated solution for pick-and-drop operations without requiring any training. To evaluate its performance, we run OK-Robot in 10 real-world home environments. The results demonstrate that OK-Robot achieves a 58.5% success rate in open-ended pick-and-drop tasks, representing a new state-of-the-art in Open Vocabulary Mobile Manipulation (OVMM) with nearly 1.8x the performance of prior work. On cleaner, uncluttered environments, OK-Robot’s performance increases to 82%. However, the most important insight gained from OK-Robot is the critical role of nuanced details when combining Open Knowledge systems like VLMs with robotic modules. Videos of our experiments and code are available on our website: https://ok-robot.github.io

标题: Learned Contextual LiDAR Informed Visual Search in Unseen Environments

作者: Ryan Gupta, Kyle Morgenstein, Steven Ortega

PubTime: 2024-02-29

Downlink: http://arxiv.org/abs/2309.14150v4

Project: https://sites.google.com/view/lives-2024/home|

中文摘要: 本文介绍了LIVES：LiDAR Informed Visual Search，这是一种在未知环境中进行目标搜索的自主规划器。我们考虑基于像素的环境感知问题，其中给定宽视场2D扫描数据，并且必须执行激光雷达分割以上下文标记周围环境中的点。这些像素分类提供了在视觉搜索任务期间计划下一个最佳视点的知情先验。根据使用配备有基于地图的分类器的简单cart平台收集的专家数据来训练地图泛化分类器。自主探索规划器从扫描中获取上下文数据，并在规划更有可能产生搜索目标检测的视点之前使用该数据。为了实现这一点，我们提出了一个效用函数，该函数考虑了传统的度量，如信息增益和路径成本，以及来自扫描分类器的附加上下文信息。LIFES在模拟中以几种现有的探索方法为基准，以验证其性能。最后，在两个看不见的环境中用Spot机器人搜索单个和多个目标的真实实验中验证了该方法。实验验证、实施细节和开放源代码的视频可在我们的项目网站https://sites.google.com/view/lives-2024/home上找到。

摘要: This paper presents LIVES: LiDAR Informed Visual Search, an autonomous planner for target search in unknown environments. We consider the pixel-wise environment perception problem where one is given wide Field of View 2D scan data and must perform LiDAR segmentation to contextually label points in the surroundings. These pixel classifications provide an informed prior on which to plan next best viewpoints during visual search tasks. The map-generalizable classifier is trained from expert data collected using a simple cart platform equipped with a map-based classifier. An autonomous exploration planner takes the contextual data from scans and uses that prior to plan viewpoints more likely to yield detection of the search target. In order to achieve this, we propose a utility function that accounts for traditional metrics like information gain and path cost and also for the additional contextual information from the scan classifier. LIVES is baselined against several existing exploration methods in simulation to verify its performance. Finally, it is validated in real-world experiments searching for single and multiple targets with a Spot robot in two unseen environments. Videos of experimental validation, implementation details and open source code can be found on our project website at https://sites.google.com/view/lives-2024/home.

标题: YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information

作者: Chien-Yao Wang, I-Hau Yeh, Hong-Yuan Mark Liao

PubTime: 2024-02-29

Downlink: http://arxiv.org/abs/2402.13616v2

GitHub: https://github.com/WongKinYiu/yolov9|

中文摘要: 当今的深度学习方法主要关注如何设计最合适的目标函数，使模型的预测结果最接近地面真实。同时，必须设计一个适当的架构，以便于获取足够的预测信息。现有方法忽略了一个事实，即当输入数据进行逐层特征提取和空间变换时，会丢失大量信息。本文将深入研究数据在深度网络中传输时数据丢失的重要问题，即信息瓶颈和可逆函数。我们提出了可编程梯度信息（PGI）的概念，以应对深度网络实现多个目标所需的各种变化。PGI可以为目标任务计算目标函数提供完整的输入信息，从而获得可靠的梯度信息来更新网络权重。此外，设计了一种新的基于梯度路径规划的轻量级网络体系结构——广义高效层聚合网络（GELAN）。GELAN的架构证实了PGI在轻量级模型上获得了卓越的结果。我们在基于MS COCO数据集的目标检测上验证了所提出的GELAN和PGI。结果表明，与基于深度卷积开发的最新方法相比，GELAN仅使用传统卷积算子来实现更好的参数利用。PGI可用于从轻型到大型的各种型号。它可以用来获得完整的信息，因此从头开始训练的模型可以比使用大型数据集预训练的最先进的模型获得更好的结果，比较结果如图1所示。源代码位于：https：//github.com/WongKinYiu/yolov 9。

摘要: Today’s deep learning methods focus on how to design the most appropriate objective functions so that the prediction results of the model can be closest to the ground truth. Meanwhile, an appropriate architecture that can facilitate acquisition of enough information for prediction has to be designed. Existing methods ignore a fact that when input data undergoes layer-by-layer feature extraction and spatial transformation, large amount of information will be lost. This paper will delve into the important issues of data loss when data is transmitted through deep networks, namely information bottleneck and reversible functions. We proposed the concept of programmable gradient information (PGI) to cope with the various changes required by deep networks to achieve multiple objectives. PGI can provide complete input information for the target task to calculate objective function, so that reliable gradient information can be obtained to update network weights. In addition, a new lightweight network architecture – Generalized Efficient Layer Aggregation Network (GELAN), based on gradient path planning is designed. GELAN’s architecture confirms that PGI has gained superior results on lightweight models. We verified the proposed GELAN and PGI on MS COCO dataset based object detection. The results show that GELAN only uses conventional convolution operators to achieve better parameter utilization than the state-of-the-art methods developed based on depth-wise convolution. PGI can be used for variety of models from lightweight to large. It can be used to obtain complete information, so that train-from-scratch models can achieve better results than state-of-the-art models pre-trained using large datasets, the comparison results are shown in Figure 1. The source codes are at: https://github.com/WongKinYiu/yolov9.

标题: PolypNextLSTM: A lightweight and fast polyp video segmentation network using ConvNext and ConvLSTM

作者: Debayan Bhattacharya, Konrad Reuter, Finn Behrendt

PubTime: 2024-02-28

Downlink: http://arxiv.org/abs/2402.11585v3

GitHub: https://github.com/mtec-tuhh/PolypNextLSTM|

中文摘要: 通常用于息肉分割，单一图像UNet架构缺乏临床医生在诊断息肉时从视频数据中获得的时间洞察力。为了更忠实地反映临床实践，我们提出的解决方案PolypNextLSTM利用基于视频的深度学习，利用时间信息以最少的参数开销实现卓越的分割性能，使其可能适用于边缘设备。PolypNextLSTM采用了一种类似UNet的结构，以ConvNext-Tiny为主干，战略性地省略了最后两层，以减少参数开销。我们的时间融合模块，卷积长短期存储器（ConvLSTM），有效地利用了时间特征。我们的主要新颖性在于PolypNextLSTM，它脱颖而出，成为参数最精简、模型最快的产品，超过了五种最先进的基于图像和视频的深度学习模型的性能。SUN-SEG数据集的评估涵盖了易于检测和难以检测的息肉场景，以及包含快速运动和闭塞等挑战性伪影的视频。与5个基于图像和5个基于视频的模型的比较证明了PolypNextLSTM的优越性，在难以检测的息肉测试集上实现了0.7898的Dice分数，超过了基于图像的PraNet（0.7519）和基于视频的PNSPlusNet（0.7486）。值得注意的是，我们的模型在具有复杂伪影（如重影和遮挡）的视频中表现出色。PolypNextLSTM将修剪的ConvNext-Tiny与ConvLSTM集成用于时间融合，不仅表现出优异的分割性能，而且在评估的模型中保持最高的每速度帧数。访问代码：https://github.com/mtec-tuhh/PolypNextLSTM

摘要: Commonly employed in polyp segmentation, single image UNet architectures lack the temporal insight clinicians gain from video data in diagnosing polyps. To mirror clinical practices more faithfully, our proposed solution, PolypNextLSTM, leverages video-based deep learning, harnessing temporal information for superior segmentation performance with the least parameter overhead, making it possibly suitable for edge devices. PolypNextLSTM employs a UNet-like structure with ConvNext-Tiny as its backbone, strategically omitting the last two layers to reduce parameter overhead. Our temporal fusion module, a Convolutional Long Short Term Memory (ConvLSTM), effectively exploits temporal features. Our primary novelty lies in PolypNextLSTM, which stands out as the leanest in parameters and the fastest model, surpassing the performance of five state-of-the-art image and video-based deep learning models. The evaluation of the SUN-SEG dataset spans easy-to-detect and hard-to-detect polyp scenarios, along with videos containing challenging artefacts like fast motion and occlusion. Comparison against 5 image-based and 5 video-based models demonstrates PolypNextLSTM’s superiority, achieving a Dice score of 0.7898 on the hard-to-detect polyp test set, surpassing image-based PraNet (0.7519) and video-based PNSPlusNet (0.7486). Notably, our model excels in videos featuring complex artefacts such as ghosting and occlusion. PolypNextLSTM, integrating pruned ConvNext-Tiny with ConvLSTM for temporal fusion, not only exhibits superior segmentation performance but also maintains the highest frames per speed among evaluated models. Access code here https://github.com/mtec-tuhh/PolypNextLSTM

标题: FairSeg: A Large-Scale Medical Image Segmentation Dataset for Fairness Learning Using Segment Anything Model with Fair Error-Bound Scaling

作者: Yu Tian, Min Shi, Yan Luo

PubTime: 2024-02-27

Downlink: http://arxiv.org/abs/2311.02189v3

Project: https://ophai.hms.harvard.edu/harvard-fairseg10k|

GitHub: https://github.com/Harvard-Ophthalmology-AI-Lab/FairSeg|

中文摘要: 近年来，人工智能模型中的公平性获得了更多关注，尤其是在医学领域，因为医学模型中的公平性对人们的福祉和生活至关重要。需要高质量的医疗公平数据集来促进公平学习研究。现有的医学公平性数据集都用于分类任务，没有公平性数据集可用于医学分割，而医学分割是与分类同等重要的临床任务，它可以提供器官异常的详细空间信息，供临床医生评估。在本文中，我们提出了第一个用于医学分割的公平性数据集，名为Harvard-FairSeg，包含10,000个受试者样本。此外，我们提出了一种公平的误差界缩放方法，使用分段任意模型（SAM），用每个身份组中的误差界上限重新加权损失函数。我们预计，通过明确处理每个身份组中具有高训练错误的困难情况，可以提高分割性能公平性。为了促进公平比较，我们利用一种新的公平尺度细分性能指标来比较公平背景下的细分指标，如公平尺度骰子系数。通过全面的实验，我们证明了我们的公平误差限制缩放方法与最先进的公平学习模型相比具有更好或相当的公平性能。数据集和代码可通过https：//ophia.hms.harvard.edu/harvard-fairseg10k。公开访问。

摘要: Fairness in artificial intelligence models has gained significantly more attention in recent years, especially in the area of medicine, as fairness in medical models is critical to people’s well-being and lives. High-quality medical fairness datasets are needed to promote fairness learning research. Existing medical fairness datasets are all for classification tasks, and no fairness datasets are available for medical segmentation, while medical segmentation is an equally important clinical task as classifications, which can provide detailed spatial information on organ abnormalities ready to be assessed by clinicians. In this paper, we propose the first fairness dataset for medical segmentation named Harvard-FairSeg with 10,000 subject samples. In addition, we propose a fair error-bound scaling approach to reweight the loss function with the upper error-bound in each identity group, using the segment anything model (SAM). We anticipate that the segmentation performance equity can be improved by explicitly tackling the hard cases with high training errors in each identity group. To facilitate fair comparisons, we utilize a novel equity-scaled segmentation performance metric to compare segmentation metrics in the context of fairness, such as the equity-scaled Dice coefficient. Through comprehensive experiments, we demonstrate that our fair error-bound scaling approach either has superior or comparable fairness performance to the state-of-the-art fairness learning models. The dataset and code are publicly accessible via https://ophai.hms.harvard.edu/harvard-fairseg10k.

标题: Overcoming Dimensional Collapse in Self-supervised Contrastive Learning for Medical Image Segmentation

作者: Jamshid Hassanpour, Vinkle Srivastav, Didier Mutter

PubTime: 2024-02-27

Downlink: http://arxiv.org/abs/2402.14611v2

Project: https://biomedicalimaging.org/2024/)|

GitHub: https://github.com/CAMMA-public/med-moco|

中文摘要: 当标记数据量有限时，自我监督学习（SSL）方法取得了巨大成功。在SSL中，模型通过解决借口任务来学习健壮的特征表示。一个这样的借口任务是对比学习，它涉及形成相似和不相似的输入样本对，指导模型区分它们。在这项工作中，我们研究了对比学习在医学图像分析领域的应用。我们的发现表明，MoCo v2，一种最先进的对比学习方法，在应用于医学图像时遇到了维度坍缩。这归因于医学图像之间共享的高度图像间相似性。为了解决这个问题，我们提出了两个关键的贡献：局部特征学习和特征去相关。局部特征学习提高了模型关注图像局部区域的能力，而特征去相关消除了特征之间的线性相关性。我们的实验结果表明，我们的贡献显著增强了模型在医学分割下游任务中的性能，无论是在线性评估还是完全微调设置中。这项工作说明了有效地使SSL技术适应医学成像任务特征的重要性。源代码将在以下网址公开：https://github.com/CAMMA-public/med-moco

摘要: Self-supervised learning (SSL) approaches have achieved great success when the amount of labeled data is limited. Within SSL, models learn robust feature representations by solving pretext tasks. One such pretext task is contrastive learning, which involves forming pairs of similar and dissimilar input samples, guiding the model to distinguish between them. In this work, we investigate the application of contrastive learning to the domain of medical image analysis. Our findings reveal that MoCo v2, a state-of-the-art contrastive learning method, encounters dimensional collapse when applied to medical images. This is attributed to the high degree of inter-image similarity shared between the medical images. To address this, we propose two key contributions: local feature learning and feature decorrelation. Local feature learning improves the ability of the model to focus on the local regions of the image, while feature decorrelation removes the linear dependence among the features. Our experimental findings demonstrate that our contributions significantly enhance the model’s performance in the downstream task of medical segmentation, both in the linear evaluation and full fine-tuning settings. This work illustrates the importance of effectively adapting SSL techniques to the characteristics of medical imaging tasks. The source code will be made publicly available at: https://github.com/CAMMA-public/med-moco

专属领域论文订阅

关注{晓理紫|小李子}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文。