Papers We Love Milano: Inside the Mind of an LLM

Name: Papers We Love Milano: Inside the Mind of an LLM
Start: 2024-10-10T19:00:00+02:00
End: 2024-10-10T21:00:00+02:00
Location: eDreams ODIGEO Tech Hub

Schedule

Thu Oct 10 2024 at 07:00 pm to 09:00 pm

UTC+02:00

Location

eDreams ODIGEO Tech Hub | Milano, LO

Emanuele Fabbiano of Xtream joins us again for a new talk about the internals of LLMs!
About this Event

Papers We Love Milano: Inside the Mind of an LLM

Elevator Pitch

Recent advances in Large Language Model (LLM) explainability have yielded intriguing results. This talk will explore the most recent breakthroughs: how we discovered Llamas think in English and altered Claude’s belief system, inducing it to give the highest importance to the Golden Gate Bridge. Finally, we'll examine the implications of these findings for AI privacy and security.

Abstract

In 2022 and 2023, explaining the inner workings of large language models (LLMs) seemed like a daunting task. However, recent studies in 2024 have led to significant breakthroughs in this area. This talk will explore three of the most important discoveries in LLM explainability.

1. Llamas “Think” in English [1]

Researchers from EPFL have revealed that models from the Llama 2 family of LLMs use English as their internal representation, regardless of the input or output language. This behaviour may account for certain biases in the models' style when used with non-English languages.

2. Monosemantic Features in Claude 3 and GPT-4 [2]

Researchers from Anthropic, later followed by OpenAI, have succeeded in collapsing the internal representation of Claude 3 Sonnet and GPT-4 into monosemantic features. This discovery enables a deeper understanding of which areas of the model are associated with specific topics and allows for the adjustment of the relative importance of these topics. This technique holds promise for aligning LLMs with ethical values.

3. LLMs Memorize Unusual Data [3, 4]

Recent research has shown why LLMs tend to memorize specific samples that are outliers compared to the normal data distribution. Unique strings, such as names and personal information, are particularly prone to memorization and can be reproduced by the models, especially when exploring similarly unusual data spaces. This finding explains instances where GPT-4 have shared personal information when prompted to repeat the same word indefinitely.

Finally, we will discuss the implications of these findings on privacy and security in the context of LLMs.

Time Split

Talk Structure

0-5 mins: Introduction - Why LLMs are challenging to explain
5-10 mins: Llamas “think” in English
10-25 mins: Monosemantic features in Claude 3 and GPT-4
25-35 mins: LLMs Memorize Unusual Data
35-50 mins: Conclusion - Impact on security and privacy

Resources

Wendler, Chris, et al. "Do llamas work in english? on the latent language of multilingual transformers." arXiv preprint arXiv:2402.10588 (2024).
Templeton, Adly. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Anthropic, 2024.
Jarmul, Katharine. Practical Data Privacy. " O'Reilly Media, Inc.", 2023.
Jarmul, Katharine. Your Model Probably Memorized the Training Data, at PyData Berlin, 20224