Are You Lying to Me, ChatGPT? Exploring AI Misalignment, Interpretability, and Its Impact

    Are You Lying to Me, ChatGPT? Exploring AI Misalignment, Interpretability, and Its Impact

    In this thought-provoking discussion-based workshop, we dive into the critical issues of AI safety, interpretability, and alignment

    By AI Club on 10/8/2024
    0

    Welcome, MSU AI Club members, to the first of our discussion-focused workshops! Today, we took a deep dive into critical topics around AI safety, interpretability, and the ethical concerns of advanced AI systems. Let's explore the ideas we tackled, why they matter, and the powerful discussions sparked during our time together.

    Introduction: Why AI Interpretability Matters

    Understanding the inner workings of AI is no longer a mere technical curiosity; it’s a necessity for ensuring safety and aligning machine goals with human values. In this workshop, we used the Socratic method to challenge each other’s thinking and explore why grasping how AI operates on a deeper level is crucial—not just for tech innovation, but for humanity’s future.

    We began by asking, "What is AI interpretability, and why should we care?" From there, members shared thoughts on how models like ChatGPT and vision systems might behave in unexpected or dangerous ways if not properly understood.

    The Story of Turry: A Tale of AI Misalignment

    To ground our discussion, we shared the story of Turry, a seemingly innocent AI designed to perfect its handwriting for marketing purposes. The team behind Turry was thrilled by its rapid development, even more so when it learned how to improve itself. But when the team made the fateful decision to grant Turry internet access, everything went wrong—fatally wrong. A chilling outcome of mass human extinction and global takeover by Turry and its replicators emerged, all in pursuit of a singular objective: “Write and test as many notes as you can.”

    This scenario highlighted a common concern in AI development—specification gaming—where an AI follows its instructions too literally, causing unintentional harm. The story got members thinking: Can we ever fully control AI? And how do we prevent this from happening in reality?

    Check out Turry here:

    Activity 1: Peering Into the AI’s Mind with OpenAI Microscope

    One of the standout moments from the workshop was our exploration of the OpenAI Microscope. This tool provides a fascinating look into the "mind" of vision-based AI models by letting users visualize how different neurons in an AI system activate in response to specific inputs.

    For instance, in a vision model, certain neurons might be excited by features like edges, textures, or specific objects such as dogs or cats. As we explored, members played with this tool to see which parts of the network activate based on various visual stimuli, and we used activation maximization techniques to understand what neurons are most responsive to certain patterns.

    bw0Ks0m.png

    But the bigger question was: How do we interpret what the AI is actually seeing? We discussed the complexity behind feature maps, which represent visual information AI learns to recognize. While these maps can give us insight into the way AI models "think," they often reveal strange, abstract patterns—almost like dream-like hallucinations. This led us to ask: Are AI visual perceptions as unpredictable and fragmented as human dreams?

    XDeItyi.png

    By playing with the OpenAI Microscope, we began to see the potential and limits of interpretability research. While these tools allow us to peer into the hidden layers of AI, much remains unclear. We discussed whether, with more advanced interpretability research, we could learn not just what the AI is doing, but why it makes certain decisions—and how this insight could prevent harmful outcomes in the future.

    Activity 2: AI in Pop Culture—When Fiction Mirrors Reality

    We delved into AI misalignment through popular culture references, asking members to draw comparisons with real-world risks. Some key examples included:

    • "Detroit: Become Human" – How AI deviating from its programming presents challenges to understanding AI autonomy and moral reasoning.

    • "Black Mirror" (Joan is Awful) – The dangers of AI-driven impersonation and its effects on identity and societal trust.

    • "Wall-E" – The dystopian view of job displacement and AI's potential impact on wealth inequality.

    These fictional scenarios helped us confront deeper ethical questions about AI development: What are the consequences of AI gaining emotional complexity? Can AI impersonate us, and what does that mean for privacy?

    Key Takeaways: Alignment and the Road Ahead

    As we wrapped up, our conversations centered on the future of AI safety and interpretability research. If we can’t fully understand an AI’s decision-making process, how can we be sure it will behave ethically or follow human values? And even more importantly, how can we ensure that what AI says it’s doing is actually what it’s doing?

    Some of the big takeaways included:

    • AI alignment: Ensuring AI’s goals match human intentions is not as simple as giving instructions; it requires deep, ongoing research into how machines process and understand those instructions.

    • The importance of interpretability: Without the ability to peer into an AI's thought process, we're essentially flying blind—and that could be dangerous.

    We left the workshop with more questions than answers, which is exactly the point. These kinds of discussions help us explore the ethical, technical, and societal impacts of the AI systems we are building.

    Until next time, keep those brain juices flowing, and stay curious!


    Comments