Makes me wonder if anyone's tried to estimate the proportion of contributions to withdrawals that would make a significant difference to a given model's outputs. I'm pretty sure I'm not the first to think of doing this on a small scale (locally) and make some extrapolations from the results. Seems a fun personal project to satisfy my curiosity, dunno if it has any real-world value😅
Absolutely -- I think this kind of experiment is going to be critical to upcoming content deals, and ultimately should inform the new set of rules about content/info/knowledge use in the "post-web scale AI world".
One thing that makes me hopeful here is that these kind of experiments are also interesting from a safety / explainability / interpretability lens -- so I think a lot of these experiments will happen (and the estimation methods will get better with more technical work). One probably-not-so-great outcome is if all the major AI operators have really accurate estimates of data impact but none of the creators / public do, but this is where academic research incentives (publishing, etc.) can play a big role. And I do think public bodies have a role to play here in doing some "public valuation" as well (more on this to come...)
Makes me wonder if anyone's tried to estimate the proportion of contributions to withdrawals that would make a significant difference to a given model's outputs. I'm pretty sure I'm not the first to think of doing this on a small scale (locally) and make some extrapolations from the results. Seems a fun personal project to satisfy my curiosity, dunno if it has any real-world value😅
Absolutely -- I think this kind of experiment is going to be critical to upcoming content deals, and ultimately should inform the new set of rules about content/info/knowledge use in the "post-web scale AI world".
One thing that makes me hopeful here is that these kind of experiments are also interesting from a safety / explainability / interpretability lens -- so I think a lot of these experiments will happen (and the estimation methods will get better with more technical work). One probably-not-so-great outcome is if all the major AI operators have really accurate estimates of data impact but none of the creators / public do, but this is where academic research incentives (publishing, etc.) can play a big role. And I do think public bodies have a role to play here in doing some "public valuation" as well (more on this to come...)