Brett Trout
The New York Times (NYT) has sued Defendants Microsoft and various OpenAI companies for allegedly using its copyrighted material to train their Copilot and ChatGPT products respectfully. The lawsuit includes claims of copyright infringement (direct, vicarious, and contributory), removal of copyright management information, unfair competition by misappropriation, and trademark dilution.
In its complaint, NYT argues the Defendants violated its copyright by publicly displaying NYT material (“Works”) in the form of GPT output and in synthetic search results. NYT also claims that in response to inquiries, ChatGPT’s API returned “hallucinations.” Hallucinations are article titles and hyperlinks purportedly promulgated by NYT but which do not, in fact, exist. Another issue is Defendants allegedly built AI training datasets containing millions of copies of Times Works garnered by scraping Works from NYT websites and reproducing Works from third-party datasets, NYT argues that by storing, processing, and reproducing the training datasets containing millions of copies of Works to train the GPT models on Microsoft’s supercomputing platform, and by disseminating generative output containing copies and derivatives of Works through the ChatGPT and Bing Chat offerings, Defendants has infringed NYT’s copyrights.

In response to this lawsuit, Defendants have requested to see copies of any documents NYT used to create the Works. The reasoning is that under the law NYT may only claim copyright over those portions of the Works that are original to the author and owned by NYT. More precisely, NYT cannot claim ownership over any portions of the Works that were AI-generated, copied from other authors, and/or in the public domain. While NYT may claim copyright in derivative works created from pre-existing material, the extent of that protection extends only to NYT’s contributions that are non-trivial and original.
Defendants have a point. NYT’s copyright registrations could be invalidated if, hypothetically: 1) some of the Works include non-copyrightable material; 2) that inaccurate information was knowingly withheld from the Copyright Office;and 3) the Copyright Office would not have granted the registration if the correct information had been provided (17 U.S. Code § 411). In reality, even assuming some hypothetical non-copyrightable information was included and not revealed to the Copyright Office, proving that NYT knowingly withheld such information from the Copyright Office would be very difficult. On a more practical note, even if NYT did not knowingly withhold such information, any revelation that NYT secretly uses third-party or AI-generated material in its news stories could be critically damaging the NYT’s reputation. In such a hypothetical, NYT might be inclined to drop the lawsuit, rather than disclose this information publicly. So while the discovery request may not have much teeth from a technical standpoint, it may have a large impact from a practical standpoint.
Not surprisingly, NYT has objected to the Defendants’ discovery requests, arguing that they are overboard, vague, and that the requested material is protected by the reporter’s privilege under the First Amendment. Defendants responded stating that the requests are narrowly tailored, do not request source identities, that the information is not reasonably obtainable from other sources, and that the information is critical in determining whether NYT is asserting copyright over material that is non-expressive, non-human authored, created by third-parties, and/or in the public domain.
It is unclear at this point whether the court will compel NYT to turn over the requested documents. What is clear is that if you are going to try to sue someone for using your copyrighted work to train its AI, it is critically important that the copyrighted work you are asserting is actually yours and does not contain any non-expressive, non-human authored, third-party, and/or public domain material. This is especially important to identify when filing for copyright registration in your works. When filing a copyright registration on any work that contains any pre-existing material it is critical to create a chart and record of each pre-exiting and AI-generated element that went into the creation, along with a file containing copies, and/or the names, of each pre-existing and AI-generated work (taking care not to infringe any third-party copyrights in the process). For large operations generating hundreds or thousands of copyrighted works a day this additional documentation could be quite onerous. In addition to protecting your copyright registrations from invalidity, this extra record-keeping fosters an environment that incentivizes authors to avoid improperly using third-party material. More importantly, this type of detailed documentation may encourage the infringer to acquiesce to a quicker settlement on better terms.
Recent Comments