ECCC-Report TR24-098https://eccc.weizmann.ac.il/report/2024/098Comments and Revisions published for TR24-098en-usFri, 07 Jun 2024 02:18:31 +0300
Revision 2
| Models That Prove Their Own Correctness |
Noga Amit,
Shafi Goldwasser,
Orr Paradise,
Guy Rothblum
https://eccc.weizmann.ac.il/report/2024/098#revision2How can we trust the correctness of a learned model on a particular input of interest? Model accuracy is typically measured *on average* over a distribution of inputs, giving no guarantee for any fixed input. This paper proposes a theoretically-founded solution to this problem: to train *Self-Proving models* that prove the correctness of their output to a verification algorithm V via an Interactive Proof.
Self-Proving models satisfy that, with high probability over an input sampled from a given distribution, the model generates a correct output *and* successfully proves its correctness to V. The *soundness* property of V guarantees that, for *every* input, no model can convince V of the correctness of an incorrect output. Thus, a Self-Proving model proves correctness of most of its outputs, while *all* incorrect outputs (of any model) are detected by V. We devise a generic methods for learning Self-Proving models, and prove its convergence under certain assumptions.
The theoretical framework and results are complemented by experiments on an arithmetic capability: computing the greatest common divisor (GCD) of two integers. Our learning method is used to train a Self-Proving transformer that computes the GCD *and* proves the correctness of its answer.Fri, 07 Jun 2024 02:18:31 +0300https://eccc.weizmann.ac.il/report/2024/098#revision2
Revision 1
| Models That Prove Their Own Correctness |
Noga Amit,
shafi goldwasser,
Orr Paradise,
Guy Rothblum
https://eccc.weizmann.ac.il/report/2024/098#revision1How can we trust the correctness of a learned model on a particular input of interest? Model accuracy is typically measured *on average* over a distribution of inputs, giving no guarantee for any fixed input.
This paper proposes a theoretically-founded solution to this problem: to train *Self-Proving models* that prove the correctness of their output to
a verification algorithm $V$ via an Interactive Proof.
Self-Proving models satisfy that, with high probability over an input sampled from a given distribution, the model generates a correct output *and* successfully proves its correctness to $V\!$. The *soundness* property of $V$ guarantees that, for *every* input, no model can convince $V$ of the correctness of an incorrect output. Thus, a
Self-Proving model proves correctness of most of its outputs, while *all* incorrect outputs (of any model) are detected by $V$. We devise a generic methods for learning Self-Proving models, and prove its convergence under certain assumptions.
The theoretical framework and results are complemented by experiments on an arithmetic capability: computing the greatest common divisor (GCD) of two integers. Our learning method is used to train a Self-Proving transformer that computes the GCD *and* proves the correctness of its answer.Fri, 07 Jun 2024 02:15:15 +0300https://eccc.weizmann.ac.il/report/2024/098#revision1
Paper TR24-098
| Models That Prove Their Own Correctness |
Noga Amit,
shafi goldwasser,
Orr Paradise,
Guy Rothblum
https://eccc.weizmann.ac.il/report/2024/098How can we trust the correctness of a learned model on a particular input of interest? Model accuracy is typically measured $on\ average$ over a distribution of inputs, giving no guarantee for any fixed input. This paper proposes a theoretically-founded solution to this problem: to train $Self$-$Proving\ models$ that prove the correctness of their output to a verification algorithm $V$ via an Interactive Proof.
Self-Proving models satisfy that, with high probability over a random input, the model generates a correct output $and$ successfully proves its correctness to $V\!$. The $soundness$ property of $V$ guarantees that, for $every$ input, no model can convince $V$ of the correctness of an incorrect output. Thus, a Self-Proving model proves correctness of most of its outputs, while $all$ incorrect outputs (of any model) are detected by $V$. We devise a generic method for learning Self-Proving models, and we prove convergence bounds under certain assumptions.
The theoretical framework and results are complemented by experiments on an arithmetic capability: computing the greatest common divisor (GCD) of two integers. Our learning method is used to train a Self-Proving transformer that computes the GCD $and$ proves the correctness of its answer.Sun, 02 Jun 2024 15:10:26 +0300https://eccc.weizmann.ac.il/report/2024/098