科学研究_上海人工智能实验室

科学研究

Research

首页 > 科学研究

研究方向

人工智能基础理论

开展人工智能前沿基础理论研究，包括机器学习、强化学习、深度学习、知识计算、因果推理、信息安全等；关注人工智能交叉学科研究，探索数据驱动的科学研究新范式。

人工智能开放平台

构建人工智能新型大数据、算法和算力等平台，全面支撑人工智能基础和应用研究。

人工智能基础软件和基础硬件系统

开展人工智能基础软硬件系统的研发，构建技术生态的软硬件基础，包括新一代人工智能训练框架、编程语言、编译器等基础软件，人工智能芯片、传感器等基础硬件。

人工智能应用

探索人工智能技术在城市、交通、医疗、教育、文旅、金融、制造业等行业的应用，关注新领域，开展共性技术平台的研发。

人工智能核心技术

发展新一代人工智能技术，包括计算机视觉、自然语言处理、语音处理、决策智能、智能机器人、城市计算、计算机图形学、数字孪生等。

人工智能伦理与政策

关注人工智能可能引发的经济、社会、伦理、法律、安全、隐私和数据治理等问题，提出解决方案，提供政策参考。

学术成果

发表会议及期刊：arXiv 2024

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.

发表会议及期刊：arXiv 2024

Implicit Event-RGBD Neural SLAM

Implicit neural SLAM has achieved remarkable progress recently. Nevertheless, existing methods face significant challenges in non-ideal scenarios, such as motion blur or lighting variation, which often leads to issues like convergence failures, localization drifts, and distorted mapping.

发表会议及期刊：arXiv 2024

GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting

In this paper, we introduce GS-SLAM that first utilizes 3D Gaussian representation in the Simultaneous Localization and Mapping (SLAM) system. It facilitates a better balance between efficiency and accuracy.

发表会议及期刊：arXiv 2024

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Hallucination, posed as a pervasive challenge of multimodal large language models (MLLMs), has significantly impeded their real-world usage that demands precise judgment. Existing methods mitigate this issue with either training with specific designed data or inferencing with external knowledge from other sources, incurring inevitable additional costs.

发表会议及期刊：CVPR 2024

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

We introduce Deformable Convolution v4 (DCNv4), a highly efficient and effective operator designed for a broad spectrum of vision applications. DCNv4 addresses the limitations of its predecessor, DCNv3, with two key enhancements

发表会议及期刊：arXiv 2024

ReZero: Boosting MCTS-based Algorithms by Just-in-Time and Speedy Reanalyze

MCTS-based algorithms, such as MuZero and its derivatives, have achieved widespread success in various decision-making domains. These algorithms employ the reanalyze process to enhance sample efficiency, albeit at the expense of significant wall-clock time consumption.

发表会议及期刊：CVPR 2024

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

Despite significant recent progress in the field of autonomous driving, modern methods still struggle and can incur serious accidents when encountering long-tail un foreseen events and challenging urban scenarios. On the one hand, large language models (LLM) have shown impressive reasoning capabilities that approach “Artificial General Intelligence”.

发表会议及期刊：NeurIPS 2024

LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios

Building agents based on tree-search planning capabilities with learned models has achieved remarkable success in classic decision-making problems, such as Go and Atari. However, it has been deemed challenging or even infeasible to extend Monte Carlo Tree Search (MCTS) based algorithms to diverse real-world applications, especially when these environments involve complex action spaces and significant simulation costs, or inherent stochasticity.

发表会议及期刊：arXiv 2024

Geometry-enhanced Pretraining on Interatomic Potentials

Machine learning interatomic potentials (MLIPs) describe the interactions between atoms in materials and molecules by learning them from a reference database generated by ab initio calculations. MLIPs can accurately and efciently predict such interactions and have been applied to various felds of physical science. However, high-performance MLIPs rely on a large amount of labelled data, which are costly to obtain by ab initio calculations.

发表会议及期刊：arXiv

2022

BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision

We present a novel bird’s-eye-view (BEV) detector with perspective supervision, which converges faster and better suits modern image backbones. Existing state-of-theart BEV detectors are often tied to certain depth pretrained backbones like VoVNet, hindering the synergy between booming image backbones and BEV detectors. To address this limitation, we prioritize easing the optimization of BEV detectors by introducing perspective view supervision. To this end, we propose a two-stage BEV detector, where proposals from the perspective head are fed into the bird’s-eye-view head for final predictions. To evaluate the effectiveness of our model, we conduct extensive ablation studies focusing on the form of supervision and the generality of the proposed detector. The proposed method is verified with a wide spectrum of traditional and modern image backbones and achieves new SoTA results on the large-scale nuScenes dataset. The code shall be released soon.

发表会议及期刊：arXiv

2023

OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation

Recent advances in modeling 3D objects mostly rely on synthetic datasets due to the lack of large-scale realscanned 3D databases. To facilitate the development of 3D perception, reconstruction, and generation in the real world, we propose OmniObject3D, a large vocabulary 3D object dataset with massive high-quality real-scanned 3D objects. OmniObject3D has several appealing properties: 1) Large Vocabulary: It comprises 6,000 scanned objects in 190 daily categories, sharing common classes with popBCorresponding authors. https://omniobject3d.github.io/ ular 2D datasets (e.g., ImageNet and LVIS), benefiting the pursuit of generalizable 3D representations. 2) Rich Annotations: Each 3D object is captured with both 2D and 3D sensors, providing textured meshes, point clouds, multiview rendered images, and multiple real-captured videos. 3) Realistic Scans: The professional scanners support highquality object scans with precise shapes and realistic appearances. With the vast exploration space offered by OmniObject3D, we carefully set up four evaluation tracks: a) robust 3D perception, b) novel-view synthesis, c) neural surface reconstruction, and d) 3D object generation. Extensive studies are performed on these four benchmarks, revealing 1 arXiv:2301.07525v2 [cs.CV] 11 Apr 2023 new observations, challenges, and opportunities for future research in realistic 3D vision.

发表会议及期刊：arXiv

2023

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

发表会议及期刊：arXiv

2023

InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language

We present an interactive visual framework named InternGPT, or iGPT for short. The framework integrates chatbots that have planning and reasoning capabilities, such as ChatGPT, with non-verbal instructions like pointing movements that enable users to directly manipulate images or videos on the screen. Pointing (including gestures, cursors, etc.) movements can provide more flexibility and precision in performing vision-centric tasks that require fine-grained control, editing, and generation of visual content. The name InternGPT stands for interaction, nonverbal, and chatbots. Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2. Additionally, in iGPT, an auxiliary control mechanism is used to improve the control capability of LLM, and a large vision-language model termed Husky is fine-tuned for high-quality multi-modal dialogue (impressing ChatGPT-3.5-turbo with 93.89% GPT-4 Quality). We hope this work can spark new ideas and directions for future interactive visual systems.

发表会议及期刊：scientific data

2023

A Large-scale Synthetic Pathological Dataset for Deep Learning-enabled Segmentation of Breast Cancer

The success of training computer-vision models heavily relies on the support of large-scale, real-world images with annotations. Yet such an annotation-ready dataset is difficult to curate in pathology due to the privacy protection and excessive annotation burden. To aid in computational pathology, synthetic data generation, curation, and annotation present a cost-effective means to quickly enable data diversity that is required to boost model performance at different stages. In this study, we introduce a large-scale synthetic pathological image dataset paired with the annotation for nuclei semantic segmentation, termed as Synthetic Nuclei and annOtation Wizard (SNOW). The proposed SNOW is developed via a standardized workflow by applying the off-the-shelf image generator and nuclei annotator. The dataset contains overall 20k image tiles and 1,448,522 annotated nuclei with the CC-BY license. We show that SNOW can be used in both supervised and semi-supervised training scenarios. Extensive results suggest that synthetic-data-trained models are competitive under a variety of model training settings, expanding the scope of better using synthetic images for enhancing downstream data-driven clinical tasks.

发表会议及期刊：arXiv

2023

FengWu: Pushing the Skillful Global Medium-range Weather Forecast Beyond 10 Days Lead

We present FengWu, an advanced data-driven global medium-range weather forecast system based on Artificial Intelligence (AI). Different from existing data-driven weather forecast methods, FengWu solves the medium-range forecast problem from a multi-modal and multi-task perspective. Specifically, a deep learning architecture equipped with model-specific encoder-decoders and cross-modal fusion Transformer is elaborately designed, which is learned under the supervision of an uncertainty loss to balance the optimization of different predictors in a region-adaptive manner. Besides this, a replay buffer mechanism is introduced to improve medium-range forecast performance. With 39-year data training based on the ERA5 reanalysis, FengWu is able to accurately reproduce the atmospheric dynamics and predict the future land and atmosphere states at 37 vertical levels on a 0.25° latitude-longitude resolution. Hindcasts of 6-hourly weather in 2018 based on ERA5 demonstrate that FengWu performs better than GraphCast in predicting 80% of the 880 reported predictands, e.g., reducing the root mean square error (RMSE) of 10-day lead global z500 prediction from 733 to 651 m2/s2. In addition, the inference cost of each iteration is merely 600ms on NVIDIA Tesla A100 hardware. The results suggest that FengWu can significantly improve the forecast skill and extend the skillful global medium-range weather forecast out to 10.75 days lead (with ACC of z500 > 0.6) for the first time.

<<<3 4 5 6 7 >>>

comm@pjlab.org.cn

上海市徐汇区龙文路129号国际传媒港L1楼

沪ICP备2021009351号-1

科学研究

人工智能基础理论

人工智能开放平台

人工智能基础软件和基础硬件系统

人工智能应用

人工智能核心技术

人工智能伦理与政策

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Implicit Event-RGBD Neural SLAM

GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

ReZero: Boosting MCTS-based Algorithms by Just-in-Time and Speedy Reanalyze

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios

Geometry-enhanced Pretraining on Interatomic Potentials

BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision

OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language

A Large-scale Synthetic Pathological Dataset for Deep Learning-enabled Segmentation of Breast Cancer

FengWu: Pushing the Skillful Global Medium-range Weather Forecast Beyond 10 Days Lead

网站地图