I wonder what the risks are to including deleted and pre-edited content in training data. Most of the edits are going to be typos and formatting, do you want 2-3 copies of the same message with typos in them for training data? Similarly, deleted comments are mostly nonsense, unhelpful, duplicate, or highly controversial things.
If someone wants to dig through and find individual users to restore that’s one thing, but I don’t think I’d immediately choose to train off of that other data unless I had to.
I don’t see it as hypocritical at all. Public comments are, for me at least, put out for the public good. The same reason someone might license open source code with the MIT license. My issue with Reddit is that they restricted who can obtain the data and then privately sold them to only the highest bidder. They should be freely available to all who want to view them without restrictions on money or power.