SenseTime releases open-sourced model 'INTERN 2.5' for autonomous driving and robotics

March 14, 2023 11:19 pm

Chinese artificial intelligence company, SenseTime, has released a large multimodal and multitask universal model called "Intern 2.5". With 3 billion parameters, it is currently the largest and most accurate on ImageNet among the world's open-source models, and it is the only model in the object detection benchmark dataset COCO that exceeds 65.0 mAP, according to the company. The model's cross-modal open-task processing ability can provide efficient and accurate perception and understanding support for general scenarios such as autonomous driving and robots.

“Intern” was first released in November 2021 by SenseTime, the Shanghai AI Laboratory, Tsinghua University, the Chinese University of Hong Kong, and Shanghai Jiaotong University. It has since been co-developed by these institutions.

As of today, the Intern 2.5 multimodal universal large model has been open-sourced on OpenGVLab, a general visual open-source platform that SenseTime participates in.

In today's rapidly growing demand for various applications, traditional computer vision are struggling to keep up with countless specific tasks in the real world. Intern 2.5, a higher-level visual system with universal scene perception and complex problem-solving capabilities achieves this by defining tasks through text, making it possible to flexibly define the task requirements of different scenarios. It can give instructions or answers based on given visual images and prompts for tasks, thereby possessing advanced perception and complex problem-solving abilities in general scenarios such as image description, visual question-answering, visual reasoning, and text recognition.

Intern 2.5 can assist in processing various complex tasks in general scenarios, such as autonomous driving and home robots. In autonomous driving, it can significantly improve scene perception and understanding capabilities, accurately assisting vehicles in judging traffic signal states, road signs, and other information, and providing effective information input for vehicle decision-making and planning.

Apart from solving the capabilities of complex problems such as autonomous driving, the Intern 2.5 universal large model can also solve common tasks in daily life, such as generating high-quality, naturalistic images based on users' creation needs, using the diffusion model generation algorithm.

Intern 2.5 is also able to quickly retrieve visual content based on text. For example, it can return related images specified by text in a photo album, or retrieve frames most relevant to text descriptions in a video, improving the efficiency of time location tasks in videos.