Research
Research
Efficient Multimodal Intelligence Laboratory
Home / People / Research / Publications / Teaching
Our Vision 🔥
Our lab focuses on developing efficient multimodal (video, image, and language) models to enable always-on, personalized agentic AI that can understand and leverage multimodal data generated from personal devices.
Our long-term vision is to transform real-world multimodal experiences into structured personal memory, bridging perception, memory, and action, and ultimately enabling scalable, on-device Physical AI systems that operate seamlessly in everyday environments.
Research Areas 🚀
We focus on the following research directions:
- IDs (e.g., C16) refer to published papers
Efficient AI/On-Device AI: Efficient inference, efficient LLM reasoning (W2), token/KV cache compression (C15), efficient fine-tuning (C16, C9), and efficient attention & architectures
Multimodal AI: Long video understanding (C15, C18), multimodal embedding & representation (C14), and efficient visual encoders (C12)
Personalized Agentic AI: Agentic memory, personalization (C13), and agentic retrieval-augmented generation (RAG) (C17, W2, C18)
(Multimodal) Small Language Models (SLMs): Developing SLMs that achieve performance comparable to (multimodal) large language models (LLMs) on specific tasks
Additional Areas: Domain adaptation/generalization (C16, C11, C8, C4), test-time adaptation (C10, C9, C7, C6), semantic segmentation (C16, C13, C5, C4, C3, W1), anomaly detection (C10, C5), knowledge distillation (C12), image/video generation (C16, C2) and vision-language-action (VLA) models
Through these broad research directions, we aim to build efficient, practical, and scalable AI systems that can operate in real-world environments, particularly under on-device constraints.
1. Efficient AI/On_Device AI 🪶
1) Efficient Fine-Tuning (C16, C9)
2) Token/KV Cache Compression (C15)
3) Efficient LLM Reasoning (W2)
2. Multimodal AI 📸
1) Multimodal Embedding (C14)
2) Efficient Visual Encoder (C12)
3) Long Video Understanding - Video LLM (C15, P1)
3. Personalized Agentic AI 🧑
1) Personalization (C13)
2) (Multimodal) Retrieval-Augmented Generation (P1, C17)
3) Agentic Memory (TBA)
4. Additional Areas 🖼️
1) Domain Generalization/Adaptation (C16, C11, C8, C4)
2) Test-Time Adaptation (C10, C9, C7, C6)
3) Anomaly Detection (C10, C5)
Urban-Scene Segmentation / Lane Detection (C16, C13, C5, C4, C3, W1)
Research Motivation 💡
Our lab aims to develop Always-on Personalized AI that can effectively understand and utilize multimodal data generated from personal devices. To this end, we study resource-efficient, always-on multimodal (video, vision, and language) models that can continuously operate in real-world settings.
As Agentic AI becomes increasingly integrated into everyday devices, it evolves toward truly personalized systems by understanding what users see, hear, say, and do. In this paradigm, a key challenge is to efficiently accumulate large-scale multimodal data generated over time, structure it into personalized memory, and leverage it for meaningful interaction.
Furthermore, AI is transitioning from standalone generative models to agentic systems capable of planning, reasoning, and acting. These systems consist of multiple sub-tasks, where repeatedly invoking large LLMs for each task is inefficient in terms of cost and computation. Therefore, achieving sustainable AI requires maximizing the performance of task-specialized, resource-efficient SLMs.
Motivated by these challenges, we aim to extend multimodal understanding toward real-world interaction, and ultimately realize efficient on-device Physical AI systems that can operate seamlessly in everyday environments.
Bridging Research and Real-World AI 🌍
Translating research into real-world AI systems
Enabling on-device AI under practical constraints (e.g., power, compute), with inherent privacy and cost advantages
Deploying AI in real-world domains (e.g., Automotive, Robotics, IoT)