Single-Head ViT;Faster Whisper;Transformer KF;Pick-and-Draw

本文首发于公众号:机器感知

Single-Head ViT;Faster Whisper;Transformer KF;Pick-and-Draw

SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

图片

Recently, efficient Vision Transformers have shown great performance with low latency on resource-constrained devices. Conventionally, they use 4x4 patch embeddings and a 4-stage structure at the macro level, while utilizing sophisticated attention with multi-head configuration at the micro level. This paper aims to address computational redundancy at all design levels in a memory-efficient manner. We discover that using larger-stride patchify stem not only reduces memory access costs but also achieves competitive performance by leveraging token representations with reduced spatial redundancy from the early stages. Furthermore, our preliminary analyses suggest that attention layers in the early stages can be substituted with convolutions, and several attention heads in the latter stages are computationally redundant. To handle this, we introduce a single-head attention module that inherently prevents head redundancy and simultaneously boosts accuracy by parallelly combining global and local information. Building upon our solutions, we introduce SHViT, a Single-Head Vision Transformer that obtains the state-of-the-art speed-accuracy tradeoff. For example, on ImageNet-1k, our SHViT-S4 is 3.3x, 8.1x, and 2.4x faster than MobileViTv2 x1.0 on GPU, CPU, and iPhone12 mobile device, respectively, while being 1.3% more accurate. For object detection and instance segmentation on MS COCO using Mask-RCNN head, our model achieves performance comparable to FastViT-SA12 while exhibiting 3.8x and 2.0x lower backbone latency on GPU and mobile device, respectively.

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on  E-Branchformer

图片

Recent studies have advocated for fully open foundation models to promote transparency and open science. As an initial step, the Open Whisper-style Speech Model (OWSM) reproduced OpenAI's Whisper using publicly available data and open-source toolkits. With the aim of reproducing Whisper, the previous OWSM v1 through v3 models were still based on Transformer, which might lead to inferior performance compared to other state-of-the-art speech encoders. In this work, we aim to improve the performance and efficiency of OWSM without extra training data. We present E-Branchformer based OWSM v3.1 models at two scales, i.e., 100M and 1B. The 1B model is the largest E-Branchformer based speech model that has been made publicly available. It outperforms the previous OWSM v3 in a vast majority of evaluation benchmarks, while demonstrating up to 25% faster inference speed.

OptiState: State Estimation of Legged Robots using Gated Networks with Transformer-based Vision and Kalman Filtering

图片

State estimation for legged robots is challenging due to their highly dynamic motion and limitations imposed by sensor accuracy. By integrating Kalman filtering, optimization, and learning-based modalities, we propose a hybrid solution that combines proprioception and exteroceptive information for estimating the state of the robot's trunk. Leveraging joint encoder and IMU measurements, our Kalman filter is enhanced through a single-rigid body model that incorporates ground reaction force control outputs from convex Model Predictive Control optimization. The estimation is further refined through Gated Recurrent Units, which also considers semantic insights and robot height from a Vision Transformer autoencoder applied on depth images. This framework not only furnishes accurate robot state estimates, including uncertainty evaluations, but can minimize the nonlinear errors that arise from sensor measurements and model simplifications through learning. The proposed methodology is evaluated in hardware using a quadruped robot on various terrains, yielding a 65% improvement on the Root Mean Squared Error compared to our VIO SLAM baseline. Code example: https://github.com/AlexS28/OptiState

Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization

图片

Diffusion-based text-to-image personalization have achieved great success in generating subjects specified by users among various contexts. Even though, existing finetuning-based methods still suffer from model overfitting, which greatly harms the generative diversity, especially when given subject images are few. To this end, we propose Pick-and-Draw, a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods. Our approach consists of two components: appearance picking guidance and layout drawing guidance. As for the former, we construct an appearance palette with visual features from the reference image, where we pick local patterns for generating the specified subject with consistent identity. As for layout drawing, we outline the subject's contour by referring to a generative template from the vanilla diffusion model, and inherit the strong image prior to synthesize diverse contexts according to different text conditions. The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image.

BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion

图片

Witnessing the evolution of text-to-image diffusion models, significant strides have been made in text-to-3D generation. Currently, two primary paradigms dominate the field of text-to-3D: the feed-forward generation solutions, capable of swiftly producing 3D assets but often yielding coarse results, and the Score Distillation Sampling (SDS) based solutions, known for generating high-fidelity 3D assets albeit at a slower pace. The synergistic integration of these methods holds substantial promise for advancing 3D generation techniques. In this paper, we present BoostDream, a highly efficient plug-and-play 3D refining method designed to transform coarse 3D assets into high-quality. The BoostDream framework comprises three distinct processes: (1) We introduce 3D model distillation that fits differentiable representations from the 3D assets obtained through feed-forward generation. (2) A novel multi-view SDS loss is designed, which utilizes a multi-view aware 2D diffusion model to refine the 3D assets. (3) We propose to use prompt and multi-view consistent normal maps as guidance in refinement.Our extensive experiment is conducted on different differentiable 3D representations, revealing that BoostDream excels in generating high-quality 3D assets rapidly, overcoming the Janus problem compared to conventional SDS-based methods.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/248894.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

【遥感专题系列】遥感影像信息提取之——人工目视解译

​遥感影像通过亮度值或像元值的高低差异(反映地物的光谱信息)及空间变化(反映地物的空间信息)来表示不同地物的差异,这是区分不同影像地物的物理基础。 ​人工解译是目前国内使用最多的一种影像提取方法,如…

算法练习-螺旋矩阵(思路+流程图+代码)

难度参考 难度:中等 分类:数组 难度与分类由我所参与的培训课程提供,但需要注意的是,难度与分类仅供参考。以下内容均为个人笔记,旨在督促自己认真学习。 题目 给定一个正整数n,生成一个包含1到 n^2 所有元…

API网关-Apisix RPM包方式自动化安装配置教程

文章目录 前言一、简介1. etcd简介2. APISIX简介3. apisix-dashboard简介 二、Apisix安装教程1. 复制脚本2. 增加执行权限3. 执行脚本4. 浏览器访问5. 卸载Apisix 三、命令1. Apisix命令1.1 启动apisix服务1.2 停止apisix服务1.3 优雅地停止apisix服务1.4 重启apisix服务1.5 重…

Android矩阵Matrix变换setRectToRect,Kotlin

Android矩阵Matrix变换setRectToRect,Kotlin 在 Android画布Canvas裁剪区域clipRect,Kotlin-CSDN博客 基础上,增加一个点,通过setRectToRect挖出Bitmap原图中心区域的一块放到目标RectF里面。 import android.content.Context imp…

Google Chrome 常用的几个参数

1 右键--Google Chrome--属性--目标 参数作用--disable-infobars此计算机将不会再收到 Google Chrome 更新,因为 Windows XP 和 Windows Vista 不再受支持。适用于 xp、2003 的 49.x.x.x 版本。示例1--ingore-certificate-errors忽略证书错误--disable-background-…

知识库建设这些工具来帮忙,企业工作效率翻倍

在当今深度信息化的年代,知识库成了企业不可或缺的一部分,它的建设与管理显得格外重要。然而,想要建设又好又高效的知识库并非易事。好消息是,有很多优秀的工具可以让这个过程变得更加轻松,今天我们就重点来探讨其中的…

excel中去掉单元格中两个数字之间的空格

excel中去掉单元格中两个数字之间的空格 使用公式:SUBSTITUTE(A1," “,”") 解释:将A1单元格中的空格查找出来并去掉。

搭建WebGL开发环境

前言 本篇文章介绍如何搭建WebGL开发环境 WebGL WebGL的技术规范继承自免费和开源的OpenGL ES标准,从某种意义上说,WebGL就是Web版的OpenGL ES,而OpenGL ES是从OpenGL中派生出来的。他们的应用环境有区别,一般来说:…

C++:第十四讲动态规划初步

每日C知识 想要在做C小游戏里实现等待效果,可以用Sleep。 Sleep函数可以使计算机程序(进程,任务或线程)进入休眠,使其在一段时间内处于非活动状态。 一般需要头文件windows.h。 注意"Sleep"首字母要大写…

ppt背景图片怎么设置?让你的演示更加出彩!

PowerPoint是一款广泛应用于演示文稿制作的软件,而背景图片是演示文稿中不可或缺的一部分。一个好的背景图片能够提升演示文稿的整体效果,使观众更加关注你的演示内容。可是ppt背景图片怎么设置呢?本文将介绍ppt背景图片设置的三个方法&#…

Spring-boot项目+Rancher6.3部署+Nacos配置中心+Rureka注册中心+Harbor镜像仓库+NFS存储

目录 一、项目概述二、环境三、部署流程3.1 Harbor部署3.1.1 docker安装3.1.2 docker-compose安装3.1.3 安装证书3.1.4 Harbor下载配置安装 3.2 NFS存储搭建3.3 Rancher平台配置3.3.1 NFS存储相关配置3.3.2 Harbor相关配置3.3.3 Nacos部署及相关配置3.3.4 工作负载deployment配…

ubuntu20.04 安装ROS2 记录

主要参考B站古月居的ROS2入门21讲 和 以下链接(基本和视频上一致) ubuntu20.04安装ROS2 详细教程_ubuntu20.04 ros2-CSDN博客 但是中间有些需要注意的地方, 1,添加源 步骤中提到 sudo curl -sSL https://raw.githubuserconten…

Redis冲冲冲——缓存三兄弟:缓存击穿、穿透、雪崩

目录 引出缓存击穿缓存穿透缓存雪崩 总结 引出 谈谈redis的击穿、穿透、雪崩。 缓存击穿 缓存击穿:redis中没有,但是数据库有 顺序:先查缓存,判断缓存是否存在;如果缓存存在,直接返回数据;如果…

【前端素材】bootstrap3 实现地产置业公司source网页设计

一、需求分析 地产置业公司的网页通常是该公司的官方网站,旨在向访问者提供相关信息和服务。这些网页通常具有以下功能: 公司介绍:网页通常包含有关公司背景、历史、核心价值观和使命等方面的信息。此部分帮助访问者了解公司的身份和目标。 …

人工视觉仍然需要图像采集卡

最初,图像采集卡被用作模拟视频数字转换器和图像缓冲器,但如今它们能够执行复杂的任务,例如图像处理。图像采集卡的设计不断发展,旨在提高系统性能并减少计算机处理需求。 除了图像采集之外,图像采集卡还执行机器视觉…

芯片跨时钟域设计(二)

芯片培训(真实项目)介绍: 低功耗景芯SoC前端、中端、后端全流程实战培训(火爆订购中) DDR4/3项目实战培训 ARM Cortex-A72处理器12nm PR实战培训 ARM Cortex-A72处理器12nm DFT实战培训 ARM Cortex-A7处理器28nm PR实战培训(火爆价格战) RISC-V MCU 40nm全芯片…

抽象类(Java)、模板方法设计模式

一、概念 在Java中有abstract关键字,就是抽象的意思,可用来修饰类和成员方法。 用abstract来修饰类,那这个类就是抽象类;修饰方法,那这个方法就是抽象方法。 修饰符 abstract class 类名{修饰符 abstract 返回值类型…

走进水稻种植教学基地可视化:科技与农业知识的完美结合

随着科技的不断发展,农业领域也在不断创新和进步。水稻种植教学基地可视化系统是一种基于现代信息技术手段的教学方式,通过虚拟现实、3D建模等技术,将水稻种植的全过程进行模拟和展示。这种教学方式打破了传统农业教学的局限性,使…

Springboot+vue的健身房管理系统(有报告)。Javaee项目,springboot vue前后端分离项目

演示视频: Springbootvue的健身房管理系统(有报告)。Javaee项目,springboot vue前后端分离项目 项目介绍: 本文设计了一个基于Springbootvue的前后端分离的健身房管理系统,采用M(model&#xf…

知识点积累系列(一)golang语言篇【持续更新】

云原生学习路线导航页(持续更新中) 本文是 知识点积累 系列文章的第一篇,记录golang语言相关的知识点 1.结构体的mapstructure是什么 mapstructure:"default" mapstructure是一个Go语言的库,用于将一个map中的值映射到…