The team used existing scholarly articles, conference proceedings, and white papers to get insights into the prevailing trends, persistent challenges, and burgeoning opportunities.
Sources Used
Computer Vision: 3D & 2D Scene Understanding and Generation
PhoMoH: Implicit Photorealistic 3D Models of Human Heads, https://arxiv.org/pdf/2402.18545, https://arxiv.org/pdf/2212.07275
DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans, https://arxiv.org/pdf/2404.00485, https://thecvf.com/content/CVPR2024/papers/Sengupta_DiffHuman_Probabilistic_Photorealistic_3D_Reconstruction_of_Humans_CVPR_2024_paper.pdf
SPHEAR: Spherical Head Registration for Complete Statistical 3D Modeling, https://arxiv.org/pdf/2311.02461
DORSal: Diffusion for Object-centric Representations of Scenes et al., https://openreview.net/pdf?id=3zvB141F6D, https://arxiv.org/pdf/2306.08068
TextMesh: Generation of Realistic 3D Meshes From Text Prompts, https://arxiv.org/pdf/2304.12439
Observations on Synthetic Image Distributions with Stable Diffusion, https://arxiv.org/pdf/2311.00056
UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs, https://arxiv.org/pdf/2311.09257
Beyond SOT: Tracking Multiple Generic Objects at Once, https://openaccess.thecvf.com/content/WACV2024/papers/Mayer_Beyond_SOT_Tracking_Multiple_Generic_Objects_at_Once_WACV_2024_paper.pdf
LFM-3D: Learnable Feature Matching Across Wide Baselines Using 3D Signals, https://arxiv.org/pdf/2303.12779
This category consolidates advancements in both 3D and 2D computer vision. This includes 3D head model generation, human reconstruction, and scene generation from text. Research also explores improved object tracking and feature matching in diverse environments. Also, Stable Diffusion.
Foundation Multimodal AI Models: Vision, Language, Speech, and UI Understanding
A visual-language foundation model for computational pathology, https://doi.org/10.1038/s41591-024-02856-4
ScreenAI: A Vision-Language Model for UI and Infographics Understanding, https://arxiv.org/pdf/2402.04615
ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots, https://arxiv.org/html/2209.08199v2
Multimodal Web Navigation with Instruction-Finetuned Foundation Models, https://arxiv.org/pdf/2305.11854
Multimodal Modeling For Spoken Language Identification, https://arxiv.org/pdf/2309.10567
Large Scale Self-Supervised Pretraining for Active Speaker Detection, https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10447899
ChatDirector: Enhancing Video Conferencing with Space-Aware Scene Rendering and Speech-Driven Layout, https://storage.googleapis.com/gweb-research2023-media/pubtools/pdf/083d9babac388c1d509a5e259b9cea6c07594f0a.pdf
Towards Generalist Biomedical AI, https://arxiv.org/pdf/2307.14334
This section groups together advancements utilizing multiple modalities. This encompasses vision-language models for computational pathology and UI understanding, as well as multimodal approaches to web navigation, speech recognition, active speaker detection, and enhanced video conferencing.
Language Models: Scaling, Reasoning, and Security
Understanding the Dataset Practitioners Behind Large Language Models, https://arxiv.org/pdf/2402.16611
Conformal Language Modeling, https://arxiv.org/pdf/2306.10193
Chain-of-Table, https://arxiv.org/pdf/2401.04398
Resolving Code Review Comments with ML, https://storage.googleapis.com/gweb-research2023-media/pubtools/pdf/0449374e8313796a5e88f92380c64e7e090c6dfa.pdf
Learning to Rewrite Prompts for Personalized Text Generation, https://arxiv.org/pdf/2310.00152
Scaling Data-Constrained Language Models, https://neurips.cc/virtual/2023/poster/70706
Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning, https://arxiv.org/pdf/2211.04325
The Scaling Law of Synthetic Images for Model Training, for Now, https://arxiv.org/pdf/2312.04567
PaLI-X: On Scaling up a Multilingual Vision and Language Model, https://storage.googleapis.com/gweb-research2023-media/pubtools/pdf/b2573d2ddfbfc551d216db1ac4668097f0a0f3e1.pdf
Efficient Language Model Architectures for Differentially Private Federated Learning, https://arxiv.org/pdf/2403.08100
Securing the Al Software Supply Chain, https://storage.googleapis.com/gweb-research2023-media/pubtools/pdf/b2573d2ddfbfc551d216db1ac4668097f0a0f3e1.pdf
This category focuses on research related to scaling LLMs, improving their reasoning capabilities, and addressing security concerns. Research ranges from improving prompts, analyzing data limitations, and scaling synthetic data, to creating multilingual models, and securing federated learning.
Specialized AI Applications
Solving olympiad geometry without human demonstrations, https://www.nature.com/articles/s41586-023-06747-5
Data Exchange Markets via Utility Balancing, https://arxiv.org/html/2401.13053V1
SEMQA: Semi-Extractive Multi-Source Question Answering, https://arxiv.org/pdf/2311.04886
This includes projects that target specific AI applications. AlphaGeometry solves Olympiad geometry, research relates to data exchange markets, and studies semi-extractive question answering.