The paper itself was called kind of a cheeky title, your language model is secretly a reward model. And that is hinting at this idea that they have found a pretty important mathematical property. They can find the optimal optimization of language model based on these preference of different completions. Instead of estimating a new reward model and training via trial and error, like you said, if you have a preference data, you can just directly do a one step optimization.