Exploring methods to manipulate AI models and adjust their biases using techniques like causal tracing. Delving into the inner workings of neural networks, particularly large language models, to understand and regulate their complex behaviors.
AIs are often described as 'black boxes' with researchers unable to to figure out how they 'think'. To better understand these often inscrutable systems, some scientists are borrowing from psychology and neuroscience to design tools to reverse-engineer them, which they hope will lead to the design of safer, more efficient AIs.