CameraCtrl、EDTalk、Sketch3D、Diffusion^2、FashionEngine

本文首发于公众号:机器感知

CameraCtrl、EDTalk、Sketch3D、Diffusion^2、FashionEngine

NVINS: Robust Visual Inertial Navigation Fused with NeRF-augmented  Camera Pose Regressor and Uncertainty Quantification

图片

In recent years, Neural Radiance Fields (NeRF) have emerged as a powerful tool for 3D reconstruction and novel view synthesis. However, the computational cost of NeRF rendering and degradation in quality due to the presence of artifacts pose significant challenges for its application in real-time and robust robotic tasks, especially on embedded systems. This paper introduces a novel framework that integrates NeRF-derived localization information with Visual-Inertial Odometry(VIO) to provide a robust solution for robotic navigation in a real-time. By training an absolute pose regression network with augmented image data rendered from a NeRF and quantifying its uncertainty, our approach effectively counters positional drift and enhances system reliability. We also establish a mathematically sound foundation for combining visual inertial navigation with camera localization neural networks, considering uncertainty under a Bayesian framework. Experimental validation in the photore......

Is Model Collapse Inevitable? Breaking the Curse of Recursion by  Accumulating Real and Synthetic Data

图片

The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops discovered that such loops can lead to model collapse, a phenomenon where performance progressively degrades with each model-fitting iteration until the latest model becomes useless. However, several recent papers studying model collapse assumed that new data replace old data over time rather than assuming data accumulate over time. In this paper, we compare these two settings and show that accumulating data prevents model collapse. We begin by studying an analytically tractable setup in which a sequence of linear models are fit to the previous models' predictions. Previous work showed if data are replaced, the test error increases linearly with the number of model-fitting iterations; we extend this result by proving that if data instead acc......

Octopus: On-device language model for function calling of software APIs

图片

In the rapidly evolving domain of artificial intelligence, Large Language Models (LLMs) play a crucial role due to their advanced text processing and generation abilities. This study introduces a new strategy aimed at harnessing on-device LLMs in invoking software APIs. We meticulously compile a dataset derived from software API documentation and apply fine-tuning to LLMs with capacities of 2B, 3B and 7B parameters, specifically to enhance their proficiency in software API interactions. Our approach concentrates on refining the models' grasp of API structures and syntax, significantly enhancing the accuracy of API function calls. Additionally, we propose \textit{conditional masking} techniques to ensure outputs in the desired formats and reduce error rates while maintaining inference speeds. We also propose a novel benchmark designed to evaluate the effectiveness of LLMs in API interactions, establishing a foundation for subsequent research. Octopus, the fine-tuned model, is ......

EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis

图片

Achieving disentangled control over multiple facial motions and accommodating diverse input modalities greatly enhances the application and entertainment of the talking head generation. This necessitates a deep exploration of the decoupling space for facial features, ensuring that they a) operate independently without mutual interference and b) can be preserved to share with different modal input, both aspects often neglected in existing methods. To address this gap, this paper proposes a novel Efficient Disentanglement framework for Talking head generation (EDTalk). Our framework enables individual manipulation of mouth shape, head pose, and emotional expression, conditioned on video or audio inputs. Specifically, we employ three lightweight modules to decompose the facial dynamics into three distinct latent spaces representing mouth, pose, and expression, respectively. Each space is characterized by a set of learnable bases whose linear combinations define specific motions.......

FashionEngine: Interactive Generation and Editing of 3D Clothed Humans

图片

We present FashionEngine, an interactive 3D human generation and editing system that allows us to design 3D digital humans in a way that aligns with how humans interact with the world, such as natural languages, visual perceptions, and hand-drawing. FashionEngine automates the 3D human production with three key components: 1) A pre-trained 3D human diffusion model that learns to model 3D humans in a semantic UV latent space from 2D image training data, which provides strong priors for diverse generation and editing tasks. 2) Multimodality-UV Space encoding the texture appearance, shape topology, and textual semantics of human clothing in a canonical UV-aligned space, which faithfully aligns the user multimodal inputs with the implicit UV latent space for controllable 3D human editing. The multimodality-UV space is shared across different user inputs, such as texts, images, and sketches, which enables various joint multimodal editing tasks. 3) Multimodality-UV Aligned Sampler ......

GEARS: Local Geometry-aware Hand-object Interaction Synthesis

图片

Generating realistic hand motion sequences in interaction with objects has gained increasing attention with the growing interest in digital humans. Prior work has illustrated the effectiveness of employing occupancy-based or distance-based virtual sensors to extract hand-object interaction features. Nonetheless, these methods show limited generalizability across object categories, shapes and sizes. We hypothesize that this is due to two reasons: 1) the limited expressiveness of employed virtual sensors, and 2) scarcity of available training data. To tackle this challenge, we introduce a novel joint-centered sensor designed to reason about local object geometry near potential interaction regions. The sensor queries for object surface points in the neighbourhood of each hand joint. As an important step towards mitigating the learning complexity, we transform the points from global frame to hand template frame and use a shared module to process sensor features of each individual......

Sketch3D: Style-Consistent Guidance for Sketch-to-3D Generation

图片

Recently, image-to-3D approaches have achieved significant results with a natural image as input. However, it is not always possible to access these enriched color input samples in practical applications, where only sketches are available. Existing sketch-to-3D researches suffer from limitations in broad applications due to the challenges of lacking color information and multi-view content. To overcome them, this paper proposes a novel generation paradigm Sketch3D to generate realistic 3D assets with shape aligned with the input sketch and color matching the textual description. Concretely, Sketch3D first instantiates the given sketch in the reference image through the shape-preserving generation process. Second, the reference image is leveraged to deduce a coarse 3D Gaussian prior, and multi-view style-consistent guidance images are generated based on the renderings of the 3D Gaussians. Finally, three strategies are designed to optimize 3D Gaussians, i.e., structural optimiz......

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

图片

Co-speech gestures, if presented in the lively form of videos, can achieve superior visual effects in human-machine interaction. While previous works mostly generate structural human skeletons, resulting in the omission of appearance information, we focus on the direct generation of audio-driven co-speech gesture videos in this work. There are two main challenges: 1) A suitable motion feature is needed to describe complex human movements with crucial appearance information. 2) Gestures and speech exhibit inherent dependencies and should be temporally aligned even of arbitrary length. To solve these problems, we present a novel motion-decoupled framework to generate co-speech gesture videos. Specifically, we first introduce a well-designed nonlinear TPS transformation to obtain latent motion features preserving essential appearance information. Then a transformer-based diffusion model is proposed to learn the temporal correlation between gestures and speech, and performs gener......

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

图片

Controllability plays a crucial role in video generation since it allows users to create desired content. However, existing models largely overlooked the precise control of camera pose that serves as a cinematic language to express deeper narrative nuances. To alleviate this issue, we introduce CameraCtrl, enabling accurate camera pose control for text-to-video(T2V) models. After precisely parameterizing the camera trajectory, a plug-and-play camera module is then trained on a T2V model, leaving others untouched. Additionally, a comprehensive study on the effect of various datasets is also conducted, suggesting that videos with diverse camera distribution and similar appearances indeed enhance controllability and generalization. Experimental results demonstrate the effectiveness of CameraCtrl in achieving precise and domain-adaptive camera control, marking a step forward in the pursuit of dynamic and customized video storytelling from textual and camera pose inputs. Our proje......

Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of  Orthogonal Diffusion Models

图片

Recent advancements in 3D generation are predominantly propelled by improvements in 3D-aware image diffusion models which are pretrained on Internet-scale image data and fine-tuned on massive 3D data, offering the capability of producing highly consistent multi-view images. However, due to the scarcity of synchronized multi-view video data, it is impractical to adapt this paradigm to 4D generation directly. Despite that, the available video and 3D data are adequate for training video and multi-view diffusion models that can provide satisfactory dynamic and geometric priors respectively. In this paper, we present Diffusion$^2$, a novel framework for dynamic 3D content creation that leverages the knowledge about geometric consistency and temporal smoothness from these models to directly sample dense multi-view and multi-frame images which can be employed to optimize continuous 4D representation. Specifically, we design a simple yet effective denoising strategy via score composi......

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/310057.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

【电控笔记0】稳定度判断

简要概括 现控:远离虚轴,稳定度越高 自控:相位裕度PM 增益裕度GM 开环传函 不稳定条件判断

15 - Debian如何配置共享服务Samba

作者:网络傅老师 特别提示:未经作者允许,不得转载任何内容。违者必究! Debian如何配置共享服务Samba 《傅老师Debian小知识库系列之15》——原创 前言 傅老师Debian小知识库特点: 1、最小化拆解Debian实用技能&…

Groovy程序设计-【第一部分Groovy起步】-01-起步

前言: 知识点记录来源于【Groovy程序设计】一书中,本文仅作知识点记录供日后使用查询,不做教程使用。 1.安装Groovy 安装非常简单,百度一下很多教程,安装过JDK的都懂。 查看安装的groovy的版本: groov…

总结SQL相对常用的几个字符函数

目录 字符的截取 substr() trim()、ltrim()、rtrim() 字符串的拼接 ||、 字符的大小写转换 upper(column_name):大写 lower(column_name):小写 字符替换 replace() 搜索字符 instr(column_name, substring_to_find,start,n_appearence) charindex(substring_to_fi…

自然语言处理、大语言模型相关名词整理

自然语言处理相关名词整理 零样本学习(zero-shot learning)词嵌入(Embedding)为什么 Embedding 搜索比基于词频搜索效果好? Word2VecTransformer检索增强生成(RAG)幻觉采样温度Top-kTop-p奖励模…

通过WebShell登录SQL Server主机并使用SSRS报表服务

背景信息 RDS SQL Server提供了WebShell功能,允许用户通过Web界面登录到RDS SQL Server实例的操作系统中,并在该操作系统中执行命令、上传下载文件等操作。WebShell功能方便用户对RDS SQL Server实例的管理和维护,特别是在无法使用SSH客户端的…

Win10系统下的EDGE浏览器启用IE模式

Win10系统下的EDGE浏览器目前已弃用IE内核,这样在访问某些较老的网站会有兼容性问题,本文记录了在EDGE浏览器中启用IE模式的操作方法。 一、启用EDGE浏览器的IE模式 要打开Internet Explorer模式,执行以下步骤: 1、在Microsoft Edge的地址栏…

物联网SaaS平台

在信息化、智能化浪潮席卷全球的今天,物联网SaaS平台作为推动工业数字化转型的重要工具,正日益受到广泛关注。那么,物联网SaaS平台究竟是什么?HiWoo Cloud作为物联网SaaS平台又有哪些独特优势?更重要的是,它…

基于 StarRocks 的风控实时特征探索和实践

编者荐语: 金融风控特征在实时业务中至关重要,是评估和管理风险的核心指标。经过评估,滴滴最终选择了 StarRocks 作为验证选项的落地方案。通过 StarRocks 实现流批一体,成功解决了风控实时特征流批分离的难题,缩短了开…

Java虚拟机——内存的分配详解

内存区域划分 对于大多数的程序员来说,Java 内存比较流行的说法便是堆和栈,这其实是非常粗略的一种划分,这种划分的“堆”对应内存模型的 Java 堆,“栈”是指虚拟机栈,然而 Java 内存模型远比这更复杂,想深…

Xxl-job执行器自动注册不上的问题

今天新建的项目要部署xxl-job,之前部署过好多次,最近没怎么部署,生疏了。部署完之后,服务一直没有注册到执行器管理里面,找了半天也没找到原因,看数据库里的xxl_job_registry表也是一直有数据进来。 后来看…

鸿蒙 Failed :entry:default@CompileResource...

Failed :entry:defaultCompileResource... media 文件夹下有文件夹或者图片名称包含中文字符 rawfile 文件夹下文件名称、图片名称不能包含中文字符

GIS 数据格式转换

1、在线工具 mapshaper 2、数据上传 3、数据格式转换 导入数据可导出为多种格式:Shapefile、Json、GeoJson、CSV、TopJSON、KML、SVG

第十五届蓝桥杯大赛软件赛省赛 C/C++ 大学 B 组

试题 C: 好数 时间限制 : 1.0s 内存限制: 256.0MB 本题总分:10 分 【问题描述】 一个整数如果按从低位到高位的顺序,奇数位(个位、百位、万位 )上 的数字是奇数,偶数位(十位、千位、十万位 &…

海外KOL推广:情感链接策略助力品牌口碑与忠诚度提升

在当今全球化的市场环境下,品牌在海外市场的推广已经成为提升竞争力和拓展业务的关键。与此同时,海外KOL的影响力也日益凸显,他们不仅仅是产品的推荐者,更是品牌与目标市场受众之间建立情感链接的关键角色。本文Nox聚星将和大家探…

使用阿里云试用Elasticsearch学习:Search Labs Tutorials 搭建一个flask搜索应用

文档:https://www.elastic.co/search-labs/tutorials/search-tutorial https://github.com/elastic/elasticsearch-labs/tree/main/example-apps/search-tutorial Full-Text Search

Unity 中消息提醒框

Tooltip 用于ui布局 using System.Collections; using System.Collections.Generic; using UnityEngine; using TMPro; using UnityEngine.UI;[ExecuteInEditMode()] // 可以在编辑模式下运行public class Tooltip : MonoBehaviour {public TMP_Text header; // 头部文本publi…

QT_day3

完善对话框,点击登录对话框,如果账号和密码匹配,则弹出信息对话框,给出提示”登录成功“,提供一个Ok按钮,用户点击Ok后,关闭登录界面,跳转到其他界面 如果账号和密码不匹配&#xf…

第十届 蓝桥杯 单片机设计与开发项目 省赛

第十届 蓝桥杯 单片机设计与开发项目 省赛 输入: 频率信号输入模拟电压输入 输出(包含各种显示功能): LED显示SEG显示DAC输出 01 数码管显示问题:数据类型 bit Seg_Disp_Mode;//0-频率显示界面 1-电压显示界面 un…

安卓玩机工具推荐----MTK 高通芯片机型 免权限刷机 备份基带 去除锁类工具操作步骤解析

今天为友友解析一款手机维修的工具_PL,它可以刷写高通芯片 mtk芯片固件。可以备份mtk基带分区和恢复基带分区。带mtk刷写免权限。可以去除一些机型的用户锁【例如用户忘记手机锁屏密码类】以及去除机型的FRP锁等等 工具对于私人用户遇到一些手机故障 例如忘记密码锁…