diff --git a/2023/2023_11_06_Representation_Engineering/README.md b/2023/2023_11_06_Representation_Engineering/README.md new file mode 100644 index 0000000..cbdbac2 --- /dev/null +++ b/2023/2023_11_06_Representation_Engineering/README.md @@ -0,0 +1,5 @@ +# Representation Engineering + +AI systems pose many challenges and risks, such as transparency, safety, ethics, and fairness. How can we understand and control the inner workings of AI systems, especially large language models (LLMs) that have remarkable capabilities across various domains? In this seminar, we will explore the emerging area of representation engineering (RepE), a top-down approach to AI transparency that places representations at the center of analysis. We will learn about the methods and applications of RepE, such as reading and controlling representations of concepts and functions in LLMs, and how RepE can address various safety-relevant problems, such as honesty, hallucination, utility, power-seeking, emotion, harmlessness, bias, and more. We will also discuss the limitations and future directions of RepE and how it can contribute to developing more transparent and trustworthy AI systems. + +Paper: https://arxiv.org/abs/2310.01405 \ No newline at end of file diff --git a/2023/2023_11_06_Representation_Engineering/Representation Engineering.pdf b/2023/2023_11_06_Representation_Engineering/Representation Engineering.pdf new file mode 100644 index 0000000..e66caff Binary files /dev/null and b/2023/2023_11_06_Representation_Engineering/Representation Engineering.pdf differ diff --git a/README.md b/README.md index d81a09d..6d8dfcc 100644 --- a/README.md +++ b/README.md @@ -14,7 +14,7 @@ Join us at https://meet.drwhy.ai. * 16.10.2023 - [On Minimizing the Impact of Dataset Shifts on Actionable Explanations](https://github.com/MI2DataLab/MI2DataLab_Seminarium/blob/master/2023/2023_10_16_impact_of_dataset_shifts_on_actionable_eplanations.txt) - Hubert Baniecki * 23.10.2023 - [On the Robustness of Removal-Based Feature Attributions](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2023/2023_10_23_removal_based_attributions_robustness) - Mateusz Krzyziński * 30.10.2023 - Discussion - AdvXAI: Robustness of explanations -* 06.11.2023 - Introduction to RedTeaming - Maciej Chrabąszcz +* 06.11.2023 - [Representation Engineering](https://github.com/maciejchrabaszcz/MI2DataLab_Seminarium/tree/master/2023/2023_11_06_Representation_Engineering) - Maciej Chrabąszcz * 13.11.2023 - [Adaptive Testing of Computer Vision Models](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2023/2023_11_13_Adaptive_Testing_of_Computer_Vision_Models) - Mikołaj Spytek * 20.11.2023 - [Red Teaming Language Models with Language Models](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2023/2023_11_20_Red_Teaming_Language_Models_with_Language_Models) - Piotr Wilczyński * 27.11.2023 - [Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2023/2023_11_27_Red_Teaming_Language_Models_to_Reduce_Harms) - Vladimir Zaigrajew