Abstract:With the breakthrough advancements in artificial intelligence technology, multimodal large models have become a key direction driving AI towards cross-modal, scenario-based understanding and generation. To systematically review the development trajectory of this field and reveal its technological hotspots and evolutionary trends, this paper conducts research on the current status and future directions of multimodal large models based on data from ArXiv academic papers and GitHub open-source projects, utilizing data crawling and multiple analytical methods. By crawling and formatting multimodal-related papers from the ArXiv platform along with their core data (such as publication dates, keywords, etc.), as well as collecting multimodal-related open-source project data from GitHub, and employing methods such as temporal trend analysis, time series forecasting, and topic modeling, this study systematically outlines the hotspots, technological trajectories, and application directions of multimodal research. A total of 2,065 relevant papers from ArXiv and hundreds of GitHub projects were analyzed. The results reveal the developmental pathways and prospects of multimodal models, and future technological trends are forecasted based on statistical models. The findings indicate that multimodal large models exhibit characteristics of integration and diversity, with "modality fusion," "semantic generation," and "multimodal representation" emerging as core research directions. Their application domains are gradually expanding from natural language processing and computer vision to education, healthcare, and various other scenarios.