Apple researchers developed the GSM Symbolic benchmark by slightly altering the names, objects, or numbers in existing problems. For instance, “Jimmy” became “John,” and “apples” changed to “oranges.” The premise was that if the models were reasoning correctly, these superficial changes would not affect their performance.