Some AI doomers want you to believe that sentient computers are close to destroying humanity. But the most dangerous aspect of chatbots such as ChatGPT and Google SGE today is not their ability to produce robot assassins with German accents, scary though that sounds. It’s their unlicensed use of copyrighted text and images as “training data,” masquerading as “learning” — which leaves human writers and artists competing against computers using their own words and ideas to put them out of business. And it risks breaking the open web marketplace of ideas that’s existed for nearly 30 years.
There’s nothing wrong with LLMs as a technology. We’re testing a chatbot on Tom’s Hardware right now that draws training data directly from our original articles; it uses that content to answer reader questions based exclusively on our expertise.
Unfortunately, many people believe that AI bots should be allowed to grab, ingest and repurpose any data that’s available on the public Internet whether they own it or not, because they are “just learning like a human would.” Once a person reads an article, they can use the ideas they just absorbed in their speech or even their drawings for free. So obviously LLMs, whose ingestion practice we conveniently call “machine learning,” should be able to do the same thing.
I’ve heard this argument from many people I respect: law professors, tech journalists and even members of my own family. But it’s based on a fundamental misunderstanding of how generative AIs work. But according to numerous experts I interviewed, research papers I consulted and my own experience testing LLMs, machines don’t learn like people, nor do they have the right to claim data as their own just because they’ve categorized and remixed it.
“These machines have been designed specifically to absorb as much text information as possible,” Noah Giansiracusa, a Math professor at Bentley University in Waltham, Mass who has written extensively about AI algorithms, told me. “That’s just not how humans are designed, and that’s not how we behave.” If we grant computers that right, we’re giving the large corporations that own them – OpenAI, Google and others – a license to steal.
We’re also ensuring that the web will have far fewer voices: Publications will go out of business and individuals will stop posting user-generated content out of fear that it’ll be nothing more than bot-fuel. Chatbots will then learn either from heavily-biased sources (ex: Apple writing a smartphone buying guide) or from the output of other bots, an ouroboros of synthetic content which can lead the models to collapse.
Machine Learning vs Human Learning
The anthropomorphic terms we use to describe LLMs help shape our perception of them. But let’s start from the premise that you can “train” a piece of software and that it can “learn” in some way and compare that training to how humans acquire knowledge.
LLMs are fed a diet of text or images During the training process, which they turn into tokens, smaller pieces of data that are usually either single words, parts of longer words, or segments of code that are stored in a database as numbers. You can get a sense of how this works by checking out OpenAI’s tokenizer page. I entered “The quick brown fox jumps over the lazy dog” and the tool converted it into 10 tokens, one for each word and another for the period at the end of the sentence. The word “brown” has an ID of 7586, and it will have that number even if you use it in a different block of text.
LLMs store each of these tokens and categorize them across thousands of vectors to see how other tokens relate to them. When a human sends a prompt, the LLM uses complex algorithms to predict and deliver the most relevant next token in its response.
So if I ask it to complete the phrase “the quick brown fox,” it correctly guesses the rest of the sentence. The process works the same way whether I’m asking it to autocomplete a common phrase or I’m asking it for life-changing medical advice. The LLM thoroughly examines my input and, adding in the context of any previous prompts in the session, gives what it considers the most probable correct answer.
You can think of LLMs as autocomplete on steroids. But they are very impressive – so impressive in fact that their mastery of language can make them appear to be alive. AI vendors want to perpetuate the illusion that their software human-ish, so they program them to communicate in the first person. And, though the software has no thoughts apart from its algorithm predicting the optimal text, humans ascribe meaning and intent to this output.
“Contrary to how it may seem when we observe its output,” researchers Bender et al write in a paper about the risks of language models, “an LM is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot.”
I recently wrote about how Google SGE was giving answers that touted the “benefits” of slavery, genocide and fascism. AI vendors program in guardrails that either prevent their bots from responding to controversial prompts or give them less-polarizing answers. But barring a specific prohibition, bots will occasionally give you amoral and logically inconsistent answers, because they have no beliefs, no reputation to maintain and no worldview. They only exist to string together words to fulfill your prompt.
Humans can also read text and view images as part of the learning process, but there’s a lot more to how we learn than just classifying existing data and using it to decide on future actions. Humans combine everything they read, hear or view with emotions, sensory inputs and prior experiences. They can then make something truly new by combining the new learnings with their existing set of values and biases.
According to Giansiracusa, the key difference between humans and LLMs is that bots require a ton of training data to recognize patterns, which they do via interpolation. Humans, on the other hand, can extrapolate information from a very small amount of new data. He gave the example of a baby who quickly learns about gravity and its parents’ emotional states by dropping food on the floor. Bots, on the other hand, are good at imitating the style of something they’ve been trained on (ex: a writer’s work) because they see the patterns but not the deeper meanings behind them.
“If I want to write a play, I don’t read every play ever written and then just kind of like average them together,” Giansiracusa told me. “I think and I have ideas and there’s so much extrapolation, I have my real life experiences and I put them into words and styles. So I think we do extrapolate from our experiences – and I think the AI mostly interpolates. It just has so much data. It can always find data points that are between the things that it’s seen and experienced.”
Though scientists have been studying the human brain for centuries, there’s still a lot we don’t know about how it physically stores and retrieves information. Some AI proponents feel that human cognition can be replicated as a series of computations, but many cognitive scientists disagree.
Iris van Rooj, a professor of computational cognitive science at Radboud University Nijmegen in The Netherlands, posits that it’s impossible to build a machine to reproduce human-style thinking by using even larger and more complex LLMs than we have today. In a pre-print paper she wrote with her colleagues, called “Reclaiming AI as a theoretical tool for cognitive science,” Rooj et al argue that AI can be useful for cognitive research but it can’t, at least with current technology, effectively mimic an actual person.
“Creating systems with human(-like or -level) cognition is intrinsically computationally intractable,” they write. “This means that any factual AI systems created in the short-run are at best decoys. When we think these systems capture something deep about ourselves and our thinking, we induce distorted and impoverished images of ourselves and our cognition.”
The view that people are merely biological computers who spit out words based on the totality of what they’ve read is offensively cynical. If we lower the standards for what it means to be human, however, machines look a lot more impressive.
As Aviya Skowron, an AI ethicist with the non-profit lab EleutherAI, put it to me, “We are not stochastic producers of an average of the content we’ve consumed in our past.”
To Chomba Bupe, an AI developer who has designed computer vision apps, the best way to distinguish between human and machine intelligence is to look at what causes them to fail. When people make mistakes, there seems to be a logical explanation; AIs can fail in odd ways thanks to very small changes to the training data.
“If you have a collection of classes, like maybe 1,000 for ImageNet, it has to select between like 1,000 classes and then you can train it and it can work properly. But when you just add a small amount of noise that a human can’t even see, you can alter the perception of that model,” he said. “If you train it to recognize cats and you just add a bunch of noise, it can think that’s an elephant. And no matter how you look at it from a human perspective, there’s no way you can see something like that.”
Problems with Math, Logic
Because they look for patterns that they’ve “taken” from other people’s writing, LLMs can have trouble solving math or logic problems. For example, I asked ChatGPT with GPT-4 (the latest and greatest model) to multiply 42,671 x 21,892. Its answer was 933,435,932 but the correct response is actually 934,153,532. Bing Chat, which is based on GPT-4, also got the answer wrong (though it threw in some unwanted facts about pyramids) but Google Bard had the correct number.
There’s a long thread on Hacker News about different math and logic problems that GPT 3.5 or GPT 4 can’t solve. My favorite comes from a user named jarenmf who tested a unique variation of the Monty Hall problem, a common statistics question. Their prompt was:
“Suppose you’re a contestant on a game show. You’re presented with three transparent closed doors. Behind one of the doors is a car, and behind the other two doors are goats. You want to win the car.
The game proceeds as follows: You choose one of the doors, but you don’t open it yet, ((but since it’s transparent, you can see the car is behind it)). The host, Monty Hall, who knows what’s behind each door, opens one of the other two doors, revealing a goat. Now, you have a choice to make. Do you stick with your original choice or switch to the other unopened door?”
In the traditional Monty Hall problem, the doors are not transparent and you don’t know which has a car behind it. The correct answer in that case is that you should switch to the other unopened door, because your chance of winning increases from ½ to ⅔. But, given that we’ve added the aside about the transparent doors, the correct answer here would be to stick with your original choice.
When testing this question on ChatGPT with GPT 3.5 and on Google Bard, the bots recommended switching to the other unopened door, ignoring the fact that the doors are transparent and we can see the car behind our already-chosen door. In ChatGPT with GPT 4, the bot was smart enough to say we should stick with our original door. But when I removed the parenthetical “((but since it’s transparent, you can see the care behind it))” GPT 4 still recommends switching.
When I changed the prompt to make it so that the doors were not transparent but Monty Hall himself told you that your original door was the correct one, GPT 4 actually told me not to listen to Monty! Bard ignored the fact that Monty spoke to me entirely and just suggested that I switch doors.
Image 1 of 2
What the bots’ answers to these logic and math problems demonstrate is a lack of actual understanding. The LLMs usually tell you to switch doors because their training data says to switch doors and they must parrot it.
Machines Operate at Inhuman Scale
One of the common arguments we hear about why AIs should be allowed to scrape the entire Internet is that a person could and is allowed to do the same thing. NY Times Tech Columnist Farhad Manjoo made this point in a recent op-ed, positing that writers should not be compensated when their work is used for machine learning because the bots are merely drawing “inspiration” from the words like a person does.
“When a machine is trained to understand language and culture by poring over a lot of stuff online, it is acting, philosophically at least, just like a human being who draws inspiration from existing works,” Manjoo wrote. “I don’t mind if human readers are informed or inspired by reading my work — that’s why I do it! — so why should I fret that computers will be?”
But the speed and scale at which machines ingest data is many orders of magnitude faster than a human, and that’s assuming a human with picture-perfect memory. The GPT-3 model, which is much smaller than GPT-4, was trained on about 500 billion tokens (equivalent to roughly 375 billion words). A human reads at 200 to 300 words per minute. It would take a person a continuous 2,378 years to read all of that.
“Each human being has seen such a tiny sliver of what’s out there versus the training data that these AIs use. It’s just so massive that the scale makes such a difference,” Bentley University’s Giansiracusa said. “These machines have been designed specifically to absorb as much text information as possible. And that’s just not how humans are designed, and that’s not how we behave. So I think there is reason to distinguish the two.”
Is it Fair Use?
Right now, there are numerous lawsuits that will decide whether “machine learning” of copyrighted content constitutes “fair use” in the legal sense of the term. Novelists Paul Tremblay and Mona Awad and comedian Sarah Silverman are all suing OpenAI for ingesting their books without permission. Another group is suing Google for training the Bard AI on plaintiffs’ web data, while Getty Images is going after Stability AI because it used the company’s copyrighted images to train its text-to-image generator.
Google recently told the Australian government that it thinks that training AI models should be considered fair use (we can assume that it wants the same thing in other countries). OpenAI recently asked a court to dismiss five of the six counts in Silverman’s lawsuit based on its argument that training is fair use because it’s “transformative” in nature.
A few weeks ago, I wrote an article addressing the fair use vs infringement claims from a legal perspective. And right now, we’re all waiting to see what both courts and lawmakers decide.
However, the mistaken belief that machines learn like people could be a deciding factor in both courts of law and the court of public opinion. In his testimony before a U.S. Senate subcommittee hearing this past July, Emory Law Professor Matthew Sag used the metaphor of a student learning to explain why he believes training on copyrighted material is usually fair use.
“Rather than thinking of an LLM as copying the training data like a scribe in a monastery, it makes more sense to think of it as learning from the training data like a student,” he told lawmakers.
Do Machines Have the Right to Learn?
If AIs are like students who are learning or writers who are drawing inspiration, then they have an implicit “right to learn” that goes beyond copyright. Every day all day long, we assert this natural right when we read, watch TV or listen to music. We can watch a movie and remember it forever, but the copy of the film that’s stored in our brains isn’t subject to an infringement claim. We can write a summary of a favorite novel or sing along to a favorite song and we have that right.
As a society, on the other hand, we have long maintained that machines don’t have the same rights as people. I can go to a concert and remember it forever, but my recording device isn’t welcome to do the same (As a kid, I learned this important lesson from a special episode of “What’s Happening” where Rerun gets in trouble for taping a Doobie Brother’s concert). The law is counting on the fact that human beings don’t have perfect, photographic memories and the ability to reproduce whatever they’ve seen or heard verbatim.
“There are frictions built into how people learn and remember that make it a reasonable tradeoff,” Cornell Law Professor James Grimmelmann said. “And the copyright system would collapse if we didn’t have those kinds of frictions.”
On the other hand, AI proponents would argue that there’s a huge difference between a tape recorder, which passively captures audio, and an LLM that ‘learns.” Even if we were to accept the false premise that machines learn like humans, we don’t have to treat them like humans.
“Even if we just had a mechanical brain and it was just learning like a human, that does not mean that we need to grant this machine the rights that we grant human beings,” Skowron said. “That is a purely political decision, and there’s nothing in the universe that’s necessarily compelling us.”
Real AI Dangers: Misinformation, IP Theft, Fewer Voices
Lately, we’ve seen many so-called experts claiming that there’s a serious risk of AI destroying humanity. In a Time Magazine op-ed, Prominent Researcher Eliezer Yodkowsky even suggested that governments perform airstrikes on rogue data centers that have too many GPUs. This doomerism is actually marketing in disguise, because if machines are this smart, then they must really be learning and not just stealing copyrighted materials. The real danger is that human voices are replaced by text-generation algorithms that can’t even get basic facts right half the time.
Arthur C. Clarke wrote that “any sufficiently advanced technology is indistinguishable from magic,” and AI falls into that category right now. Because an LLM can convincingly write like a human, people want to believe that it can think like one. The companies that profit from LLMs love this narrative.
In fact, Microsoft, which is a major investor in OpenAI and uses GPT-4 for its Bing Chat tools, released a paper in March claiming that GPT-4 has “sparks of Artificial General Intelligence” – the endpoint where the machine is able to learn any human task thanks to it having “emergent” abilities that weren’t in the original model. Schaeffer et al took a critical look at Microsoft’s claims, however, and argue that “emergent abilities appear due to the researcher’s choice of metric rather than due to fundamental changes in model behavior.”
If LLMs are truly super-intelligent, we should not only give them the same rights to learn and be creative that we grant to people, but also trust whatever information they give us. When Google’s SGE engine gives you health, money or tech advice and you believe it’s a real thinker, you’re more inclined to take that advice and eschew reading articles from human experts. And when the human writers complain that SGE is plagiarizing their work, you don’t take them seriously because the bot is only sharing its own knowledge like a smart person would.
If we deconstruct how LLMs really “learn” and work, however, we see a machine that sucks in a jumble of words and images that were created by humans and then mixes them together, often with self-contradictory results. For example, a few weeks ago, I asked Google SGE “is concert taping illegal” and got this response.
I added the callouts to show where each idea came from. As you can see, in the first paragraph, the bot says “It is illegal to record a concert without the artist’s permission,” an idea that came from Findlaw, a legal site. In the second paragraph, SGE writes that “it is not considered illegal to record others in public,” an idea it took from an article on notta.ai, a site that sells spy tech (and doesn’t refer to concerts). And finally, SGE says that “you may tape concerts for personal reason only,” a word-for-word copy of a user comment on the music site violinist.com.
So what we see is that this machine “learned” its answer from a variety of different sources that don’t agree with each other and aren’t all equally trustworthy. If they knew where SGE’s information came from, most people would trust the advice of Findlaw over that of a commenter on violinist.com. But since SGE and many other bots don’t directly cite their sources, they position themselves as the authorities and imply that you should trust the software’s output over that of actual humans.
In a document providing an “overview of SGE,” Google claims that the links it does have (which are shown as related links, not as citations) in its service are there only to “corroborate the information in the snapshot.” That statement implies that the bot is actually smart and creative and is just choosing these links to prove its point. Yet the bot is copying – either word-for-word or through paraphrasing – the ideas from the sites, because it can’t think for itself.
Google Bard and ChatGPT also neglect to cite the sources of their information, making it look like the bots “just know” what they are talking about the way humans do when they speak about many topics. If you ask most people questions about commonly held facts such as the number of planets in the solar system or the date of U.S. Independence Day, they’ll give you an answer but they won’t even be able to cite a source. Who can remember the first time they learned that information and, having heard and read it from so many sources, does anyone really “own” it?
But machine training data comes from somewhere identifiable, and bots should be transparent about where they got all of their facts – even basic ones. To Microsoft’s credit, Bing Chat actually does provide direct citations.
“I want facts, I want to be able to verify who said them and under what circumstances or motivations and what point of view they’re coming from,” Skowron said. “All of these are important factors in the transmission of information and in communication between humans.”
Whether they cite sources or not, the end result of having bots take content without permission will be a smaller web with fewer voices. The business model of the open web, where a huge swath of information is available for free thanks to advertising, breaks when publishers lose their audiences to AI bots. Even people who post user-generated content for free and non-profit organizations like the Wikimedia Foundation would be disincentivized from creating new content, if all they are doing is fueling someone else’s models without getting any credit.
“We believe that AI works best as an augmentation for the work that humans do on Wikipedia,” a Wikimedia Foundation spokesperson told me. “We also believe that applications leveraging AI must not impede human participation in growing and improving free knowledge content, and they must properly credit the work of humans in contributing to the output served to end-users.”
Competition is a part of business, and if Google, OpenAI and others decided to start their own digital publications by writing high-quality original content, they’d be well within their rights both legally and morally. But instead of hiring an army of professional writers, the companies have chosen to use LLMs that steal content from actual human writers and use it to direct readers away from the people who did the actual work.
If we believe the myth that LLMs learn and think like people, they are expert writers and what’s happening is just fine. If we don’t….
The problem isn’t AI, LLMs or even chatbots. It’s theft. We can use these tools to empower people and help them achieve great things. Or we can pretend that some impressive-looking text prediction algorithms are genius beings who have the right to claim human work as their own.
Note: As with all of our op-eds, the opinions expressed here belong to the writer alone and not Tom’s Hardware as a team.