{"id":832,"date":"2025-02-04T20:57:29","date_gmt":"2025-02-04T19:57:29","guid":{"rendered":"https:\/\/idea.gm.th-koeln.de\/?p=832"},"modified":"2025-02-04T20:57:32","modified_gmt":"2025-02-04T19:57:32","slug":"iclr-2024-review-by-jens-brandt","status":"publish","type":"post","link":"https:\/\/www.spotseven.de\/?p=832","title":{"rendered":"ICLR 2024 Review by Jens Brandt"},"content":{"rendered":"\n<p>The year 2024 brought numerous exciting conferences \u2013 one of them was the ICLR (<a href=\"https:\/\/iclr.cc\/Conferences\/2024\">International Conference on Learning Representations<\/a>) in Vienna. In this article, Jens Brandt, a PhD student at the IDE+A Institute, shares his impressions of the conference, exciting research highlights, and personal experiences from the world of machine learning.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>ICLR 2024 Primer<\/strong><\/h2>\n\n\n\n<p><strong>Dominant Keywords:<\/strong><\/p>\n\n\n\n<p>&#8211; Large Language Model.  Reinforcement Learning, Diffusion Model, Graph Neural Network, Generative Model, Interpretability<\/p>\n\n\n\n<p><strong>KPIs<\/strong><\/p>\n\n\n\n<p>&#8211; <strong>7262<\/strong> Submissions, <strong>2260<\/strong> Accepted&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p>&#8211; <strong>6533<\/strong> Total Attendees<\/p>\n\n\n\n<p>&#8211; <strong>8950<\/strong> Reviewers&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p>&#8211; <strong>1647<\/strong> USA <strong>814<\/strong> China <strong>494<\/strong> Germany<\/p>\n\n\n\n<p>&#8211; <strong>41.1%<\/strong> of all Submissions Written with LLM Support&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>Summary<\/strong><\/p>\n\n\n\n<p>This year\u2019s ICLR was dominated by research in the field of large language models and visual- and multi-modal foundation models. A lot of research happened towards in-context learning, fine-tuning strategies, model size reduction and mechanistical understanding of the underlying dynamics in large foundational models. Besides that, Reinforcement Learning, Data Centric ML and AI for Science (e.g. PINNs) were also very dominant in the poster sessions.<\/p>\n\n\n\n<!--more-->\n\n\n\n<p><strong>Keynote Highlights<\/strong><\/p>\n\n\n\n<p>In her keynote talk &#8211; <strong>Why your work matters for climate in more ways than you think<\/strong> \u2013 Priya L. Donti (Assistant Professor MIT, Co-Founder and Chair, Climate Change AI) elaborated on the influence of the AI community on the climate. AI applications, ranging from forecasting models to enhancing machine efficiency, offer promising avenues for mitigating and understanding climate change. In contrast, there are also AI applications that increase emissions, e.g. efficiency improvements in the oil industry. The increase in hardware efficiency, which has kept pace with the demand for computing power in the past, has slowed down and in recent years humanity has required more and more computing power (including AI applications). Moreover, the systemic societal impacts of AI continue to expand. Highlighting a disparity between the prevailing AI paradigm and the reality on the ground, she argues that the reality is often characterized by little data that is difficult to move, limited computing power (e.g. edge devices) and the need to save energy. Methodological frontiers critical for addressing climate concerns include physics-informed ML, safe and robust ML, interpretable ML, uncertainty quantification, generalization, causality, energy-efficient ML, TinyML, and AutoML.<\/p>\n\n\n\n<p>In her keynote talk &#8211; <strong>Learning through AI\u2019s winters and springs: unexpected truths on the road to AGI<\/strong> \u2013 Raia Hadsell (Senior Director of Research and Robotics at DeepMind) reflected the past decades of AI research. She constitutes that the heydays of Reinforcement Learning are over since learning from scratch is extremely challenging and currently no sufficient level of generalization seems achievable. She sees the biggest chance to break those hard walls by deploying a lot of agents in a real-world environment instead of simulations, since the reality is much more complex and noisier than simulations. Furthermore, she discusses if models should be multitudes or monoliths and highlights the advantages of multitudes in respect to distributed training on differing hardware (See DiLoCo and DiPaCo). Lastly, she talked about the great possibilities in AI for Science (e.g. GenCast).<\/p>\n\n\n\n<p>In his keynote talk &#8211; <strong>The ChatGLM&#8217;s Road to AGI<\/strong> \u2013 Tang Jie (Professor Tsinghua University) talks about a suite of models from Zhipu AI that match the offered models from OpenAI. He highlighted that most of their models are open source and elaborated on their idea of an AGI and some finding regarding emerging abilities.<\/p>\n\n\n\n<p>In his keynote talk \u2013 <strong>The emerging science of benchmarks<\/strong> \u2013 Moritz Hardt (Director Max Planck Institute for Intelligent Systems) analyzed the dynamics of publicly available benchmarks that are used excessively in the research community. He shows that the vault assumption doesn\u2019t hold on those test sets because there is a closed feedback loop between the ML Community and the benchmarks which in theory should reduce the life span of such benchmark sets. Luckily, the competitive nature of those benchmarks seems to work as some kind of regularization and \u201crecovers\u201d the vault assumption. Furthermore, he showed that the extensive data cleaning and curating of ImageNet was probably not necessary [1]. Finally, he discussed the polymorphic era of benchmarks, multi-task benchmarks, concerns about static benchmarks and the ambitious plans to develop dynamic benchmarks that are continuously adversarial extended.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Test of Time Award<\/strong><\/h2>\n\n\n\n<p>For the ICLR Test of Time award, the Program Chairs examined papers from ICLR 2013 &amp; 2014 and looked for ones with long-lasting impact. The winner is <strong>Auto-Encoding Variational Bayes<\/strong> by Diederik Kingma and Max Welling [2]. This paper gave rise to the variational autoencoder (VAE) and integration of deep learning with scalable probabilistic inference. [3]<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Research Highlights<\/strong><\/h2>\n\n\n\n<p>A lot of works investigated how the size of a model can be reduced without losing much performance in the process. Especially pruning approaches, that try to remove less important weights from a model and initialization approaches that try to initialize a small model with weights based on a trained big model seemed promising. [4-8] Especially the work of A. Bair et al. stood out because it proposes a pruning method that tries to find a sparse model that does not lose its out-of-distribution robustness (See Fig. 1) [9].<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"172\" data-attachment-id=\"833\" data-permalink=\"https:\/\/www.spotseven.de\/?attachment_id=833\" data-orig-file=\"https:\/\/i0.wp.com\/www.spotseven.de\/wp-content\/uploads\/2025\/02\/Bild1.png?fit=1048%2C176&amp;ssl=1\" data-orig-size=\"1048,176\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Bild1\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/www.spotseven.de\/wp-content\/uploads\/2025\/02\/Bild1.png?fit=1024%2C172&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/idea.gm.th-koeln.de\/wp-content\/uploads\/2025\/02\/Bild1-1024x172.png?resize=1024%2C172&#038;ssl=1\" alt=\"\" class=\"wp-image-833\" srcset=\"https:\/\/i0.wp.com\/www.spotseven.de\/wp-content\/uploads\/2025\/02\/Bild1.png?resize=1024%2C172&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.spotseven.de\/wp-content\/uploads\/2025\/02\/Bild1.png?resize=750%2C126&amp;ssl=1 750w, https:\/\/i0.wp.com\/www.spotseven.de\/wp-content\/uploads\/2025\/02\/Bild1.png?resize=500%2C84&amp;ssl=1 500w, https:\/\/i0.wp.com\/www.spotseven.de\/wp-content\/uploads\/2025\/02\/Bild1.png?resize=768%2C129&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.spotseven.de\/wp-content\/uploads\/2025\/02\/Bild1.png?resize=720%2C121&amp;ssl=1 720w, https:\/\/i0.wp.com\/www.spotseven.de\/wp-content\/uploads\/2025\/02\/Bild1.png?resize=560%2C94&amp;ssl=1 560w, https:\/\/i0.wp.com\/www.spotseven.de\/wp-content\/uploads\/2025\/02\/Bild1.png?w=1048&amp;ssl=1 1048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 1: AdaSAP is a three-step process that takes as input a dense pretrained model and outputs a sparse robust model. [9]<\/figcaption><\/figure>\n\n\n\n<p class=\"has-medium-font-size\">My favorite paper and presentation of the conference <strong>Vision Transformers Need Registers<\/strong> by T. Darcet et al. investigated artifacts in the attention maps of vision transformers, especially high-norm tokens in low-informative background areas. They assume that those tokens are used by the model to store some kind of global information for internal computation. To address this, the simple fix is to append additional register tokens to the input that serve as storage for this global information. The work of M. Sun et al. investigated this phenomenon as well (See Fig. 2). They analyzed what happens when you remove the resulting high activations during inference and found that the magnitude of the activations is important, but not the exact value. Therefore, they interpreted these activations as some kind of learned bias. To fix this issue, that might cause stability problems in training, they found an even simpler fix without the need of appending additional tokens. [10,11]<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"246\" data-attachment-id=\"834\" data-permalink=\"https:\/\/www.spotseven.de\/?attachment_id=834\" data-orig-file=\"https:\/\/i0.wp.com\/www.spotseven.de\/wp-content\/uploads\/2025\/02\/Bild2.png?fit=1048%2C252&amp;ssl=1\" data-orig-size=\"1048,252\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Bild2\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/www.spotseven.de\/wp-content\/uploads\/2025\/02\/Bild2.png?fit=1024%2C246&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/idea.gm.th-koeln.de\/wp-content\/uploads\/2025\/02\/Bild2-1024x246.png?resize=1024%2C246&#038;ssl=1\" alt=\"\" class=\"wp-image-834\" srcset=\"https:\/\/i0.wp.com\/www.spotseven.de\/wp-content\/uploads\/2025\/02\/Bild2.png?resize=1024%2C246&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.spotseven.de\/wp-content\/uploads\/2025\/02\/Bild2.png?resize=750%2C180&amp;ssl=1 750w, https:\/\/i0.wp.com\/www.spotseven.de\/wp-content\/uploads\/2025\/02\/Bild2.png?resize=500%2C120&amp;ssl=1 500w, https:\/\/i0.wp.com\/www.spotseven.de\/wp-content\/uploads\/2025\/02\/Bild2.png?resize=768%2C185&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.spotseven.de\/wp-content\/uploads\/2025\/02\/Bild2.png?resize=720%2C173&amp;ssl=1 720w, https:\/\/i0.wp.com\/www.spotseven.de\/wp-content\/uploads\/2025\/02\/Bild2.png?resize=560%2C135&amp;ssl=1 560w, https:\/\/i0.wp.com\/www.spotseven.de\/wp-content\/uploads\/2025\/02\/Bild2.png?w=1048&amp;ssl=1 1048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Fig. 2. Activation Magnitudes (z-axis) in LLaMA2-7B. [11]<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>2nd Workshop on Mathematical and Empirical Understanding of Foundation Models<\/strong><\/h2>\n\n\n\n<p>The workshop kicked off with a talk by Sasha Rush (Associate Professor Cornell University, Hugging Face). He contrasted new state space models like Mamba or xLSTM against transformers. The biggest benefit of the former lies in the faster, less memory-heavy computation of the models, which especially comes in handy when training models with very long context length. After that, he presented MambaByte, a Language Model that works without Tokenization (on Byte Level), which wouldn\u2019t be possible with transformer models due to the quadratic compute increase with the sequence length. Lastly, he introduced DiffuSSM, a SSM for Image Generation.<\/p>\n\n\n\n<p>After a Talk by Yuandong Tian (META, FAIR) about recent findings in respect to the self-attention mechanism, Hannaneh Hajishirzi (University of Washington) presented OLMo, a A State-of-the-Art, Truly Open LLM and Framework with corresponding cleaned training dataset. If you are interested in all the details of training modern LLMs, this work is for you. There were further interesting presentations about the inner workings of transformer models in the afternoon. if you are curious, I can provide some more material on this, but my train ride home is about to end soon so I stop here.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Sources<\/strong><\/h2>\n\n\n\n<p>1. <a href=\"https:\/\/arxiv.org\/abs\/2404.02112\">https:\/\/arxiv.org\/abs\/2404.02112<\/a><\/p>\n\n\n\n<p>2. <a href=\"https:\/\/arxiv.org\/abs\/1312.6114\">https:\/\/arxiv.org\/abs\/1312.6114<\/a><\/p>\n\n\n\n<p>3.<a href=\" https:\/\/blog.iclr.cc\/2024\/05\/07\/iclr-2024-test-of-time-award\/\"> https:\/\/blog.iclr.cc\/2024\/05\/07\/iclr-2024-test-of-time-award\/<\/a><\/p>\n\n\n\n<p>4. <a href=\"https:\/\/arxiv.org\/pdf\/2306.11695\">https:\/\/arxiv.org\/pdf\/2306.11695<\/a><\/p>\n\n\n\n<p>5. <a href=\"https:\/\/openreview.net\/pdf?id=dyrGMhicMw\">https:\/\/openreview.net\/pdf?id=dyrGMhicMw<\/a><\/p>\n\n\n\n<p>6. <a href=\"https:\/\/openreview.net\/pdf?id=ldJXXxPE0L\">https:\/\/openreview.net\/pdf?id=ldJXXxPE0L<\/a><\/p>\n\n\n\n<p>7. <a href=\"https:\/\/openreview.net\/pdf?id=kOBkxFRKTA\">https:\/\/openreview.net\/pdf?id=kOBkxFRKTA<\/a><\/p>\n\n\n\n<p>8. <a href=\"https:\/\/arxiv.org\/abs\/2303.04947\">https:\/\/arxiv.org\/abs\/2303.04947<\/a><\/p>\n\n\n\n<p>9. <a href=\"https:\/\/openreview.net\/forum?id=QFYVVwiAM8\">https:\/\/openreview.net\/forum?id=QFYVVwiAM8<\/a><\/p>\n\n\n\n<p>10. <a href=\"https:\/\/arxiv.org\/abs\/2309.16588\">https:\/\/arxiv.org\/abs\/2309.16588<\/a><\/p>\n\n\n\n<p>11.<a href=\" https:\/\/openreview.net\/forum?id=1ayU4fMqme\"> https:\/\/openreview.net\/forum?id=1ayU4fMqme<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The year 2024 brought numerous exciting conferences \u2013 one of them was the ICLR (International Conference on Learning Representations) in &hellip; <a class=\"more-link\" href=\"https:\/\/www.spotseven.de\/?p=832\">More<\/a><\/p>\n","protected":false},"author":3,"featured_media":838,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"advanced_seo_description":"","jetpack_seo_html_title":"","jetpack_seo_noindex":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[1],"tags":[],"class_list":["post-832","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized",""],"acf":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/www.spotseven.de\/wp-content\/uploads\/2025\/02\/iclr2024.jpg?fit=1320%2C970&ssl=1","jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"jetpack-related-posts":[],"jetpack_shortlink":"https:\/\/wp.me\/p2DCPK-dq","_links":{"self":[{"href":"https:\/\/www.spotseven.de\/index.php?rest_route=\/wp\/v2\/posts\/832","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.spotseven.de\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.spotseven.de\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.spotseven.de\/index.php?rest_route=\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.spotseven.de\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=832"}],"version-history":[{"count":6,"href":"https:\/\/www.spotseven.de\/index.php?rest_route=\/wp\/v2\/posts\/832\/revisions"}],"predecessor-version":[{"id":841,"href":"https:\/\/www.spotseven.de\/index.php?rest_route=\/wp\/v2\/posts\/832\/revisions\/841"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.spotseven.de\/index.php?rest_route=\/wp\/v2\/media\/838"}],"wp:attachment":[{"href":"https:\/\/www.spotseven.de\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=832"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.spotseven.de\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=832"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.spotseven.de\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=832"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}