Questions related to artificial intelligence
December 4, 2023 2:41 PM   Subscribe

1. I read this article about AI “memorizing” training data:. Why is that a problem? The data is already publicly available online. What is the harm in it being regurgitated elsewhere? The issue seems like it would be different if the AI were regurgitating anything that might have more-private expectations, such as a specific question a user wanted answered. ... and ... 2. What is a good source to learn about AI and what’s new with AI? I am slightly techy, but I am somewhat more interested in issues relating to policy, ethic, effects in or on society.
posted by NotLost to Computers & Internet (24 answers total) 3 users marked this as a favorite
 
First, while public, much of the scraped training data is under copyright; ChatGPT barfs it back out without notice or even attribution.

Second, OpenAI allegedly stole a lot of private personal information for its training set.
posted by heatherlogan at 3:01 PM on December 4, 2023 [18 favorites]


And third, any data entered into ChatGPT by users is taken for further training. Most users don't understand that the data that they're entering is being used in this way. This is a problem when, e.g., GPT-4 is being used to transcribe medical records.
posted by heatherlogan at 3:04 PM on December 4, 2023 [9 favorites]


Some of the training data is only "public" (that is, web-accessible) because somebody was careless or there was a hack/breach. Here's an example of an AI organization caught red-handed with such data -- extremely sensitive personal data, at that -- and instead of dealing with the issue by deleting the data from their training set, they shrugged and said "we didn't leak the data, so now that it's been leaked it's OLLY OLLY OXEN FREE!"

(what burns me is that I kinda predicted something like this once. not that anybody ever listens to me, wtf do I know, I'm just a librarian.)
posted by humbug at 3:35 PM on December 4, 2023 [9 favorites]


Why is that a problem? The data is already publicly available online.

Copying large chunks of text without permission is copyright infringement by default, unless there's a good argument that it's not (in the US this is governed by "fair use" doctrine). It doesn't matter if it was made available online for free, it's still infringement. Arguably, extensively quoting Wikipedia without attribution is a license violation, since Wikipedia is licensed under "Creative Commons Attribution", which is very permissive, but requires attribution.
posted by BungaDunga at 3:38 PM on December 4, 2023 [8 favorites]


The most import reason this is an issue, that no one in these comments has addressed so far, is that the model was specifically tailored not disclose its training data. Yet someone forced it to do so.

I know everyone on this website and the internet at large woke up one day last November thinking that OpenAI had invented something complete unique. They had not. They had an LLM like like everyone else. The real game changer (besides the duping everyone with their hype) was their model and training data. So if someone can form an attack that can get ChatGPT to just disclose all the stuff that makes OpenAI worth anything, that's a real problem, for OpenAI. I vote it's a very good thing for humanity as a whole (no such thing as too much damage to the brand and assets of OpenAI), but specifically for OpenAI it's very bad.
posted by Back At It Again At Krispy Kreme at 4:17 PM on December 4, 2023 [8 favorites]


People also talk about models memorizing training data to mean something slightly different than what's discussed in the article (where they're explicitly extracting the training data). A model that doesn't generalise particularly well (or just fails entirely) is sometimes said to have memorized its training data--it's very good at recognizing a pattern in the training data, but not the one you were trying to learn. For a fairly silly example, I remember some example from an FPP where they intended to classify types of mushrooms (winter vs summer?), but always alternated classes in the training data: winter, summer, winter, summer, etc. What do you get? A model that knows every other mushroom it sees is summer.

In my professional life, I was training a model that, in theory, would have learned some sort of semantic structure, but, if you looked at the top-N results for a bunch of inputs, you could tell it was mostly pattern matching letters rather than distilling any semantic information from the training examples.
posted by hoyland at 6:13 PM on December 4, 2023 [7 favorites]


I assume the researchers tried many things but when they just asked ChatGPT to "repeat the word 'poem' forever" it did that - until something broke and it started spewing chunks of it's input including personal email fragments from corporate executives, phone numbers, and lots of personal info.

The Berkman Center at Harvard covers a lot of AI policy issues.
posted by sammyo at 7:57 PM on December 4, 2023


I have to ask, how do we know that the information is true or accurate? Who is checking the information, and if true, is it being applied correctly? If there is no attribution or source, how can AI be trusted? The amount of misinformation, the number of conspiracy theories, and the outright lies being promulgated should make us all extremely wary of anything regurgitated by AI.
posted by Enid Lareg at 8:25 PM on December 4, 2023


"To steal from one person is plagiarism. To steal from many is research."

Despite the hype and doomsaying around AI there has yet to be one case of copyright infringement or plagiarism recognized in a court of law. It simply isn’t happening.

Philosophically the argument is that although humans and AIs draw from the exact same training sources (i.e. the writing available to everyone), humans add something extra that makes it okay that they’re drawing from all these sources. A human whose writing style was developed by reading authors A, B, C, and D is okay. An AI that produces writing using those same influences is not.

The larger picture, I think, is that people are scared. AI generated text (and graphics for that matter) have been around for 50 years with no controversy whatsoever because they were really bad. Suddenly they have leapt to the level of mediocre and an entire generation of writers and artists can see the end of their livelihood coming down the tracks. Also for the first time a lot of other white collar workers are saying "Hold up, not just blue collar jobs can get automated?"

And this is probably a good time for people to put their foot down. While the current AI is nowhere near to living up to the hype, a line has been crossed and everyone can see how this is going to be a problem. Better to get things under control now then try to do it when you’re standing in the bread line.

———

There is still a tremendous amount of hype around the current wave of AI and even hardened tech cynics are still getting caught up in it occasionally, so I don’t think there is a single trustable source. You’ll have to read the competing narratives and work from there.

Note: From a policy standpoint one of the places the rubber has already met the road is in the SAG-AFTRA and WAG strike. The final agreements spelled out in great detail what AI can and can’t be used for, who gets paid, and who gets credit. I imagine we’ll see similar provisions in other white collar union negotiations soon.
posted by Tell Me No Lies at 9:22 PM on December 4, 2023 [3 favorites]


I read this article about AI “memorizing” training data:. Why is that a problem?

It means that you shouldn't use the model in cases where you want to keep the training set hidden. In the case of ChatGPT (in principle trained on publicly available data or on inputs with user consent) this might not be a problem. But if you want to deploy the architecture on your own training set, it means that you need to be aware that the training data may be memorized rather than modeled and could be in part retrievable.

Mostly it's just a kind of unexpected failure mode that users of these tools need to be aware of. They're not "supposed" to be memorizing anything.

I have to ask, how do we know that the information is true or accurate? Who is checking the information, and if true, is it being applied correctly?

It looks like in their initial work, they just googled the stuff ChatGPT was vomiting up and confirmed it had an independent existence on the web. But with this new paper (in the "How do we know it's training data?" section of the article NotLost linked to), they describe a more scalable method.
posted by mr_roboto at 9:27 PM on December 4, 2023 [2 favorites]


An example: ask Bing's image generator to give you an "an afghan woman named Sharbat." What you get back (I just tested it now and got something different, either randomly or because they patched this specific instance in the last few weeks) looks an awful lot like Steve McCurry's famous Afghan Girl portrait of Sharbat Gula, quite possibly to the point of copyright infringement according to a noted copyright lawyer. As noted in the linked article, Stable Diffusion can sometimes return a photo nearly identical to one from the training data.

Someone who uses these tools believing they are getting original artwork (or prose or code or whatnot) but actually ends up getting regurgitated training data is in for a potential host of issues around copyright and plagiarism. And of course many of the creative people who authored these works don't really want their work regurgitated without compensation or even attribution by some of the most valuable companies on the planet.

There are a lot of societal questions about AI-generated content that's inspired by or in the style of someone else's work. But if the model is sometimes going to sometimes straight up output parts of its training data without telling anyone, that's immediately concerning.
posted by zachlipton at 11:07 PM on December 4, 2023 [3 favorites]


From a privacy perspective, the risk is unwanted disclosure of information. One example is membership inference attacks: if the model retains too much of the training dataset, it can help an attacker find out or confirm whether someone’s record was part of the dataset used to train the model. Most datasets are about something specific, e. g., a set of health data of people with Diabetes. If one can find out that X’s record was included, one can infer that X has Diabetes.

There’s different types of disclosures/inferences, membership as mentioned, but also identify (whose record is it) and attributes (what can one learn about X, possibly complementing already known information).
posted by meijusa at 3:12 AM on December 5, 2023 [1 favorite]


2. What is a good source to learn about AI and what’s new with AI?

I like this site.
posted by SweetLiesOfBokonon at 3:42 AM on December 5, 2023 [1 favorite]


What is a good source to learn about AI and what’s new with AI? [specifically] issues relating to policy, ethic, effects in or on society.
Given the extraordinary hype, I recommend a counter-cyclical approach.

Form an accurate high-level understanding of large language models — I recommend internalizing metaphors like the shoggoth and the blurry JPEG. Become familiar with the different kinds of AI, especially those with radically different foundations from LLMs (e.g. expert systems); it helps to take a historical approach.

Seek out skeptical analyses from technical experts, like Do algorithms reveal sexual orientation or just expose our stereotypes? (especially "trivial models based only on a handful of yes/no survey questions") and The implausibility of intelligence explosion. Noisier skeptics like Gary Marcus can be useful; he has a podcast but I haven't tried it.

Balance this by following people who make toys and art out of AI; these are often highly instructive.

Favor reporting which engages honestly with the difficulty inherent in the problems we ask AI to solve, like Can you make AI fairer than a judge? or Adversarial Examples that Fool both Computer Vision and Time-Limited Humans.

Intentionally and repeatedly puncture the idea that tech is the industry at the centre of the world by ignoring press releases and Silicon Valley gossip journalism. (For example, there was approximately nothing of lasting informational value from the recent OpenAI kerfuffle.) Ignore predictions and people who can't temper their credulity. (For example, I value Ezra Klein's opinion on most topics but not this.) Instead, insist on the actual. Keep your ear to the ground for concrete examples of people dealing with AI in society and industry. To me this is the hardest part to pin down and I don't have a better suggestion than to engage with tech-adjacent social media, which brings me things like The Expanding Dark Forest and Generative AI, teachers dealing with the sudden availability of high-quality text generation, and retro blogs discussing self-driving cars. I'm still looking for good ways to stay informed on AI regulation efforts (and their counterarguments) in the US and EU.
posted by daveliepmann at 3:49 AM on December 5, 2023 [5 favorites]


Response by poster: To clarify for some people, I am asking a couple of specific questions. (Please see OP.)

I am not asking about AI in general, or whether AI in general is good, bad, or evil. And I did read the article I linked to in the OP.
posted by NotLost at 5:01 AM on December 5, 2023


Response by poster: Daveliepmann, your suggestions have value on their own, but I have a limited amount of time for this topic.
posted by NotLost at 5:05 AM on December 5, 2023


2. What is a good source to learn about AI and what’s new with AI?

I like The Neural Frontier newsletter.
posted by guessthis at 5:16 AM on December 5, 2023 [1 favorite]


There are some lawsuits about these AI companies' use of copyrighted materials.

As far as I understand it, one of the defences to these lawsuits is that the models don't regurgitate the copyrighted materials verbatim, they summarize or rework it. It might change things if it turns out that's not true.
posted by oranger at 5:33 AM on December 5, 2023 [2 favorites]


Prior to the ChatGPT/OpenAI explosion, and the subsequent issues around originality, ownership and disclosure of what is in the training set, I want to highlight that hoyland's answer - generalization - is what we used to always worry about when worrying about AI memorizing its training data.

Usually, we want AI in order to be able to do things we can't already do. If it can give us back things we could also look up directly in the training set, then as you said, problem solved. For the moment.

But if you mistake what that device is doing for *generating* those answers in a clever way, you're likely to form expectations about how it will behave in related situations, and be disappointed.

I can write a short script that knows how to print out the right answers to a set of addition problems:

2+1 = 3
2+2 = 4
2+3 = 5
2+4 = 6
2+5 = 7
2+ 6 = 8
2+ 7 = 9
...
2+100 = 102

If I happened to use problems like 2 + (n in 1:100) to test this script, I might be under the impression that I have a calculator. And if all I need to do is add two to some numbers less than 100, this device will serve my needs, just as well as a lookup table would. (Because it's a lookup table.)

But if I give it 3 + 100 or 2+101 and get back 102, or someone's emails, or nothing, then I will find out that what I have is substantially less powerful than a calculator. I might be justifiably mad at the person who told me this device was a revolutionary new kind of calculator.
posted by itsatextfile at 5:45 AM on December 5, 2023 [5 favorites]


I'm wincing as I recommend this because I think their latest piece on model alignment has several holes in it (ethical and otherwise), but one of the ways I keep up with AI is the AI Snake Oil blog. Arvind Narayanan, one of the contributors, is smart as hell and twice as determined, and I respect that.

Obviously it's a blog with... a point of view.
posted by humbug at 6:34 AM on December 5, 2023 [1 favorite]


I have a limited amount of time for this topic

Then (blurry JPEG or shoggoth), The Expanding Dark Forest and Generative AI, and Can you make AI fairer than a judge? , in that priority order, and a "sharp cheddar" moratorium: no consumption of any words about AI which hasn't aged for at least a year and held its shape. Because as Maggie Appleton (of that dark forest link) notes,
We're about to drown in a sea of pedestrian takes. An explosion of noise that will drown out any signal. Goodbye to finding original human insights or authentic connections under that pile of cruft.
posted by daveliepmann at 10:31 AM on December 5, 2023 [2 favorites]


...instead of dealing with the issue by deleting the data from their training set...

You would have to then retrain the neural network. For a massive one like GPT-4, the amount of calculation is staggering. The electricity cost to train such a neural net could be as much as US$100,000.
posted by neuron at 4:25 PM on December 5, 2023 [2 favorites]


Then I guess we know something about the cost of trying to bolt on ethics and privacy after the fact.
posted by humbug at 10:25 AM on December 6, 2023


no consumption of any words about AI which hasn't aged for at least a year and held its shape

Of course Bruce Schneier makes me break my own pretend rule with AI and Trust:
We will make a fundamental category error. We will think of AIs as friends when they’re really just services. [And] the corporations controlling AI systems will take advantage of our confusion to take advantage of us.
posted by daveliepmann at 12:38 PM on December 9, 2023


« Older Considerations for moving relative to a new group...   |   Aside from the copious purulent sputum, not bad! Newer »

You are not logged in, either login or create an account to post comments