You've proposed a very difficult version of the Turing test. It's just to assess it over several hours and also have a human judge that's fairly sophisticated on what computers can do and can't do. So you really want the human to challenge the system. On its ability to do things like common sense reasoning, perhaps. That's actually a key problem with large language models.