The crux of the whole project is how do you know whether it has become less of an injury? Like if you have some other thing sitting around that will tell you whether there is an injury in a text then you should just use that instead of your classifier. There's kind of this no free lunch theorem or whatever. So here's a classification procedure you take some text and then you tweak it using rules that you know do not change whether it was an injury until you make it look as dangerous as possible. And then you see whether this new modified snippet is in fact dangerous according to your classifiers.
Read the full transcript here.
How hard is it to arrive at true beliefs about the world? How can you find enjoyment in being wrong? When presenting claims that will be scrutinized by others, is it better to hedge and pad the claims in lots of caveats and uncertainty, or to strive for a tone that matches (or perhaps even exaggerates) the intensity with which you hold your beliefs? Why should you maybe focus on drilling small skills when learning a new skill set? What counts as a "simple" question? How can you tell when you actually understand something and when you don't? What is "cargo culting"? Which features of AI are likely in the future to become existential threats? What are the hardest parts of AI research? What skills will we probably really wish we had on the eve of deploying superintelligent AIs?
Buck Shlegeris is the CTO of Redwood Research, an independent AI alignment research organization. He currently leads their interpretability research. He previously worked on research and outreach at the Machine Intelligence Research Institute. His website is shlegeris.com.
Staff
Music
Affiliates