The constraints are the number of characters, variable kerning, and any spacing or line break decisions made by the word processing software.
Use of a variable-width font massively prunes the search space, since far fewer strings will match redaction box length pixel-perfectly.
Suggesting matching solutions requires a forward model of the string to pixel transformation, and a probabilistic model assigning probabilities to redacted strings. The forward model requires domain knowledge of the word processing software's kerning and layout process. It should also take into account the redaction process (do boxes exactly cover redacted text and is there noise in their lengths?). The probabilistic model looks at the known text and predicts the redacted text. A transformer (large language model, with or without contextual fine-tuning) would be perfect for this.
If we guess some redactions and use them to predict other redactions, we have another huge search space problem, but it is highly parallelisable. However, evaluating the joint probability of matches across multiple redactions greatly reduces the probability of false matches, making any discovered solutions more convincing.
This requires some programming, however as far as I can see it has not been implemented in the public domain. I would imagine that well-resourced intelligence organisations have already had access to approaches like this for some years, and that black boxes will soon be seen as an ineffective approach to redaction.
https://www.schneier.com/blog/archives/2024/03/using-llms-to-unredact-text.html
https://arxiv.org/pdf/2206.02285
0 comments