Ethics of NLP Models: Understanding OpenAI's Data Practices for the Legal Profession

Many legal professionals are concerned about the potential waiver of privilege and breach of confidentiality, but often, this stems from a lack of understanding of the underlying technology and its implications.

Ethics of NLP Models: Understanding OpenAI's Data Practices for the Legal Profession

I. Introduction

Welcome to another edition of The Art of Law and Technology, where we strive to bridge the gap between the complexities of the law and the ever-evolving world of technology. Today, we're diving deep into the current panic surrounding OpenAI data practices and why lawyers should not fear these advanced tools. Many legal professionals are concerned about the potential waiver of privilege and breach of confidentiality, but often, this stems from a lack of understanding of the underlying technology and its implications. In this article, we'll demystify the inner workings of Natural Language Processing models, explore the nature of data in OpenAI databases, and compare OpenAI's data collection and use with popular platforms like Practical Law, Lexis Practice Advisor, Clio Practice Management, and Google Searches. So, buckle up and join us on this journey as we shed light on the fascinating intersection of law and technology and dispel the myths surrounding OpenAI and its data practices.

II. How Natural Language Processing (NLP) Models Work

A. Definition and Purpose of NLP

Natural Language Processing (NLP) is a subfield of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language in a way that is both meaningful and useful. The primary goal of NLP is to bridge the gap between human communication and computer understanding, which can significantly enhance the efficiency and effectiveness of various tasks in different fields, including the legal profession.

In the legal industry, NLP has numerous applications that can help streamline processes, enhance research, and improve the overall quality of legal services. Some common uses of NLP in the legal field include:

  1. Document analysis and review: NLP can analyze vast amounts of text, identify relevant information, and extract critical insights, making the document review process more efficient and less time-consuming.
  2. Legal research: NLP-powered tools can assist in searching for relevant case law, statutes, and regulations by understanding the context and intent of search queries, which can significantly reduce research time.
  3. Contract analysis: NLP can help identify potential risks, inconsistencies, and areas of improvement in contracts by analyzing the language used and comparing it to predefined templates or industry standards.
  4. Chatbots and virtual assistants: NLP enables the creation of chatbots and virtual assistants that can understand and respond to legal inquiries, providing quick and reliable information to clients and legal professionals.

C. How Language Models Work

Language models are designed to understand and process how humans use language. They use a few key steps to make sense of the words and sentences we use daily. These steps include

  1. Breaking down text: The model separates a text into individual words or smaller parts to make it easier to analyze.
  2. Identifying word types: The model determines if each word is a noun, verb, adjective, or another part of speech.
  3. Analyzing sentence structure: The model looks at how words and phrases relate to each other in a sentence to make sense of their meaning.
  4. Finding important information: The model can spot and categorize essential details like names, places, and organizations in the text.
  5. Understanding emotions: The model can determine if the text expresses a positive, negative, or neutral feeling.

D. Teaching Language Models with Data

Language models learn how to understand human language by being shown lots of examples. These examples come in extensive text collections in different languages and on various topics. The more varied and extensive the examples, the better the model becomes at processing language.

Teaching a language model usually involves a method called supervised learning. In this method, the model is shown examples of input (like a sentence) and the expected output (such as the emotion the sentence conveys). The model then adjusts itself to get better at predicting the correct output based on the input. The more examples it sees, the better it becomes at making accurate predictions.

In the next section, we'll discuss how OpenAI handles data collection and storage and how they keep data secure.

E. Different Ways to Teach Language Models

There are three main teaching methods: supervised learning, unsupervised learning, and transfer learning.

  1. Supervised Learning: As explained earlier, supervised learning involves showing the model examples of input and output pairs. The model learns by making predictions based on the input and comparing it to the correct output. Then, it adjusts to minimize the difference between its predictions and the actual output. This process continues until the model makes accurate predictions on new data.
  2. Unsupervised Learning: With unsupervised learning, the model isn't given labeled output data. Instead, it learns by finding patterns and relationships within the input data. This can involve grouping similar words or documents or identifying the main topics in a collection of documents.
  3. Transfer Learning: This method starts by teaching a model using a large dataset, then fine-tuning it with a smaller, more specific dataset. The idea is that the model can learn general language patterns from the first dataset and then adapt to the particular context of the second dataset. This approach has become popular with large-scale models like OpenAI's GPT-3.

F. What is Inside a Language Model?

A language model's data mainly comprises number-based versions of words, phrases, or sentences. These number-based versions are called embeddings or vectors. The model learns to connect these numbers to their meanings, which helps it understand and use human language.

Generally, a language model has these main parts:

  1. Vocabulary: A list of unique words the model understands. These words are usually changed into a consistent format, like making them all lowercase and removing punctuation.
  2. Embeddings: The number-based versions of words that capture their meanings. These numbers help the model understand how terms relate to each other. For example, similar words will have similar number patterns.
  3. Neural Network: A connected series of layers that take in the number-based words and create predictions based on them. The neural network learns to connect the number-based words to the correct output by adjusting its inner parts during training.

G. Language Model Data When Stored

When a language model is stored, the main parts saved include the vocabulary, the embeddings, and the neural network's inner functions. This data is usually in an efficient number format and can't be directly read as text or whole sentences.

To help explain this, imagine the model as a black box filled with a jumble of numbers, words, and strings. You shake the box over and over, and each time, you drop in a message and the box spits a message out. You add more numbers, words, and strings every time the box spits out a bad message and shake it again. When you start getting good messages, you stop shaking it, or you start shaking it with a bit more finesse. Inside of the box is chaos with a randomly generated structure, and peaking inside won't tell you much, no matter what you put into the box.

In short, language model data, when stored, doesn't look like tables of words with numbers or complete sentences with extra information. Instead, it's a complicated number-based representation of language saved as embeddings and neural network parts. As a result, this data can't be read directly as text, and it wouldn't be possible for someone to pull out meaningful information or recreate the original text from the stored model data.

III. OpenAI's Data Collection and Storage Practices

A. Types of Data Collected by OpenAI

OpenAI collects various types of data to train and improve its NLP models (such as GPT-3 and ChatGPT). The data primarily consists of text from multiple sources, including books, articles, websites, and other publicly available content. This diverse dataset enables the model to learn language patterns, structures, and relationships across different contexts and domains.

In addition to the text data used for training, OpenAI also *used* to collect usage data when users interacted with its models through the API. However, the company recently updated its policies to exclude API-connected searches from training collection.

B. How Data is Stored and Processed in OpenAI Databases

OpenAI stores the collected data in secure databases. As described in the previous sections, the text data used for training the models is preprocessed, tokenized, and transformed into numerical representations.

To maintain the security and integrity of its databases, OpenAI implements various measures such as encryption, access controls, and regular audits. These measures are meant to protect the data from unauthorized access, disclosure, or modification.

C. Security Measures in Place to Protect Data

OpenAI states that it takes data security seriously and has implemented several safeguards to protect the data it collects and stores. Some security measures include:

  1. Encryption: Data is encrypted in transit and at rest, protecting it from unauthorized access or interception.
  2. Access Controls: OpenAI employs strict access controls to limit access to its databases and systems. Only authorized personnel with a legitimate need to access the data can do so, and their activities are logged and monitored for potential security risks.
  3. Regular Audits and Monitoring: OpenAI conducts regular audits and monitoring of its systems, processes, and data handling practices to identify and address potential security vulnerabilities or areas for improvement.
  4. Data Retention and Deletion: OpenAI maintains a data retention policy that outlines the duration for which data is stored and the conditions under which it is deleted. This policy helps ensure that data is not retained longer than necessary and reduces the risk of unauthorized access or disclosure.

By implementing these security measures, OpenAI aims to protect the data it collects and stores, minimize the risk of unauthorized access or disclosure and maintain the trust of its users, including legal professionals who rely on its models for various tasks.

Despite the protections OpenAI implements (and has long-implemented), they only recently clarified that API use is protected from being added to training data. This is important to note, as we will see that OpenAI's claimed protections of your data are, generally speaking, about as good as it gets on software services.

IV. Internet Services and Data Handling

A. Overview of How Internet Services Generally Work

Internet services are applications or platforms that provide various functionalities to users through the internet. These services can range from search engines and social media platforms to cloud-based software and communication tools. At their core, internet services are built upon a client-server architecture, where users (clients) interact with the service (server) by sending requests and receiving responses.

When users interact with an internet service, they typically send data (e.g., search queries, file uploads, or form inputs) to the server, which processes the data and returns the appropriate output or response. The data transmitted between the client and server is often encrypted to protect its confidentiality and integrity.

B. Common Data Storage Practices for Internet Services

Internet services store data in databases or storage systems, which can be on-premises or in the cloud. The choice of a data storage system depends on factors such as the size of the data, the required performance, and the specific needs of the service.

Some standard data storage practices for internet services include:

  1. Data Encryption: Encrypting data both in transit (when transmitted between the client and server) and at rest (when stored in the database) ensures that the data remains confidential and secure.
  2. Access Controls: Implementing strict access controls helps limit access to the stored data, ensuring that only authorized personnel can access it.
  3. Regular Backups: Regular backups of the stored data helps protect against data loss or corruption.
  4. Data Retention Policies: Establishing data retention policies helps ensure that data is not stored longer than necessary, reducing the risk of unauthorized access or disclosure.

As with any technology, both risks and benefits are associated with using internet services in the legal field. Some of the principal risks and benefits include

1. Risks

  1. Data Security: Storing sensitive client information and legal documents on external servers may expose them to potential security breaches or unauthorized access.
  2. Privacy and Confidentiality: Transmitting confidential client data over the internet may raise concerns about protecting confidentiality and attorney-client privilege.
  3. Reliability: Internet services may experience downtime or performance issues, which can impact the availability and efficiency of legal work.

2. Benefits

  1. Efficiency and Productivity: Internet services can streamline various legal tasks, such as document review, research, and contract analysis, increasing efficiency and productivity.
  2. Collaboration: Cloud-based services facilitate collaboration among legal professionals, allowing them to collaborate on documents and projects more effectively.
  3. Accessibility: Internet services can be accessed from anywhere with an internet connection, providing flexibility and convenience for legal professionals.

By understanding the risks and benefits of using internet services in the legal field, legal professionals can make informed decisions about adopting these technologies and implementing appropriate safeguards to protect their clients' interests and maintain the confidentiality of sensitive information.

The benefits of using software tools in your practice are apparent, and as you have probably guessed by now, I'm a big proponent of their use. This does not mean that software works the way you think it does. Companies claim a variety of protections when users supply data to their services. The best we can do is believe them when they tell us the steps they take to protect our data, but there is clearly a misunderstanding about what encryption of data is. Unless the service provides end-to-end encryption, the company can and probably does see your data in certain circumstances.

In the next section, we will compare OpenAI's data collection and use practices with other popular platforms, such as Practical Law, Lexis Practice Advisor, Clio Practice Management, and Google Searches. This comparison will help illustrate how OpenAI's practices align with industry standards and provide context for evaluating the potential risks and benefits associated with using its models in the legal field.

V. Comparing OpenAI Data Practices with Other Platforms

In this section, we will compare OpenAI's data collection and use practices with other popular legal platforms, such as Practical Law, Lexis Practice Advisor, Clio Practice Management, and Google Searches. This comparison will provide context for understanding OpenAI's practices and how they align with industry standards.

A. Westlaw

Westlaw, a product of Thomson Reuters, provides legal know-how, including practice notes, standard documents, checklists, and legal updates. At a high level, data collection and use practices for Westlaw include:

1. Data Collection: Westlaw collects user information such as search queries, viewed documents, and account details. This data is used to improve the platform's functionality, provide personalized content, and monitor usage patterns.

2. Data Storage and Security: Westlaw stores user data on secure servers and implements security measures such as encryption, access controls, and regular audits to protect the data.

3. Privacy and Confidentiality: Westlaw maintains a privacy policy that outlines its data handling practices and ensures compliance with data protection regulations.

B. LexisNexis

LexisNexis is a legal research platform providing practical guidance, annotated forms, and legal analysis. Its data collection and use practices include:

1. Data Collection: LexisNexis collects user data, such as search queries, document views, and account details, to enhance the platform's functionality, provide personalized content, and track usage patterns.

2. Data Storage and Security: User data is stored on secure servers, and LexisNexis implements security measures like encryption, access controls, and regular audits to protect the data.

3. Privacy and Confidentiality: LexisNexis maintains a privacy policy that outlines its data handling practices and ensures compliance with data protection regulations.

C. Clio Practice Management

Clio is a cloud-based practice management software for law firms, offering features like document management, time tracking, billing, and client communication. Clio's data collection and use practices include:

1. Data Collection: Clio collects user data, such as client information, documents, and billing details, to provide services and improve the platform's functionality.

2. Data Storage and Security: Clio stores user data on secure servers and employs security measures such as encryption, access controls, and regular audits to protect the data.

3. Privacy and Confidentiality: Clio maintains a privacy policy that outlines its data handling practices and ensures compliance with data protection regulations.

D. Google Searches

Google Search is a widely used search engine that legal professionals often rely on for research. Google's data collection and use practices include:

1. Data Collection: Google collects search queries, IP addresses, device information, and other usage data to improve its search algorithms, provide personalized results, and monitor usage patterns.

2. Data Storage and Security: Google stores user data on secure servers and implements security measures such as encryption, access controls, and regular audits to protect the data.

3. Privacy and Confidentiality: Google maintains a privacy policy that outlines its data handling practices and ensures compliance with data protection regulations.

E. Analysis of Similarities and Differences in Data Practices

One is hard-pressed to identify what is genuinely different between the high-level overviews of the various data collection practices of the above-listed companies. I have linked to the privacy policies of each of the companies, so you can scrutinize what the companies claim to do with your data, and I encourage you to do so to understand how data collection works.

It’s easy to look at the word “encrypted” and assume that no one can see your data. Generally speaking, the only circumstances where a company has zero access to your data is when end-to-end encryption is implemented. End-to-end encrypted services are those services that tell you that if you lose access to your password or if something goes wrong, the company has no way to help you. Tech support is often possible because the people on the other side of the network can see your raw data. Encryption means that unauthorized access is less likely to yield results for a bad actor. End-to-end encryption means that no one except for the primary data creator and intended content recipients can access and see the data by holding the appropriate password-esque key.

Here's how Apple describes their two levels of encryption:

iCloud data security and encryption

The security of your data in iCloud starts with the security of your Apple ID. All new Apple IDs require two-factor authentication to help protect you from fraudulent attempts to gain access to your account. Two-factor authentication is also required for many features across Apple’s ecosystem, including end-to-end encryption.

Apple offers two options to encrypt and protect the data you store in iCloud:

  • Standard data protection is the default setting for your account. Your iCloud data is encrypted, the encryption keys are secured in Apple data centers so we can help you with data recovery, and only certain data is end-to-end encrypted.
  • Advanced Data Protection for iCloud is an optional setting that offers our highest level of cloud data security. If you choose to enable Advanced Data Protection, your trusted devices retain sole access to the encryption keys for the majority of your iCloud data, thereby protecting it using end-to-end encryption. Additional data protected includes iCloud Backup, Photos, Notes, and more.

Now, compare what you enter into OpenAI to these other tools — which are you more likely to have entered confidential or privileged information into, even accidentally? The answer is not OpenAI, or at least, OpenAI is not the only tool where this occurs.

The laws, rules, and precedents that speak to whether using a given tool is a breach of confidentiality or waiver of privilege rely on privacy policies and other similar contracts. And one can see that OpenAI does not have a set-apart approach to privacy and data protection. In other words, for OpenAI to be a breach of confidentiality or waiver of privilege, the use of the other tools must constitute the same.

The use of tools like Westlaw, Clio, and LexisNexis are so ubiquitous as to be beyond a question of ethical compliance, but that is not justified by the stated privacy practices of any of the listed companies.

VI. Addressing Legal and Ethical Concerns

As legal professionals consider adopting OpenAI's models for various tasks, it's crucial to address the legal and ethical concerns surrounding their usage. In this section, we will discuss attorney-client privilege and confidentiality, how OpenAI data practices relate to these concerns, and recommendations for mitigating potential risks.

A. Overview of Attorney-Client Privilege and Confidentiality

Attorney-client privilege is a legal principle that protects communications between an attorney and their client from being disclosed to third parties. This privilege encourages open and honest communication between attorneys and clients, facilitating the provision of effective legal advice and representation.

Confidentiality, on the other hand, is a broader ethical obligation that requires attorneys to protect sensitive information related to their clients and legal matters. This obligation extends beyond the scope of attorney-client privilege and includes information obtained from other sources, such as opposing parties, witnesses, or third-party service providers.

B. How OpenAI Data Practices Relate to These Concerns

OpenAI's data practices, as discussed earlier, include the collection of text data for training its models and usage data from user interactions. While the text data used for training is sourced from publicly available content, the usage data is collected through input queries and generated outputs.

Usage data for the improvement of a service is a common policy. In fact, out of the hundreds of privacy policies I have reviewed and drafted, I have never seen one that does not collect this type of data. The best privacy policy I have ever seen was for DuckDuckGo, and even their data collection practices are analagous to entering data into OpenAI services. While DuckDuckGo will not be able to identify who entered a particular search term, if you entered a client's name into the search bar, the client's name would be associated with whatever terms you added to the search — an equal breach of confidentiality, in my estimation.

C. Recommendations for Mitigating Potential Risks

You can adopt several strategies to mitigate the risks associated with using software services (including OpenAI's models) while ensuring compliance with attorney-client privilege and confidentiality obligations:

  1. Limiting Sensitive Information: When interacting with software services, avoid including specific client names, case numbers, or other identifiable information in the input queries. This can help minimize the risk of exposing sensitive information.
  2. Monitoring Usage: Regularly review your usage of software services to ensure that you are not inadvertently disclosing confidential information or violating attorney-client privilege. If you identify any potential breaches, take appropriate steps to address them and inform the necessary parties.
  3. Training and Awareness: Ensure that all members of your legal team know the ethical obligations related to attorney-client privilege and confidentiality, and provide training on best practices for using software services while maintaining compliance with these obligations.
  4. Reviewing Privacy Policies and Updates: Stay informed about your software services' data practices by regularly checking their privacy policies and any updates to their data handling procedures. This will help you remain aware of any changes that may impact your use of their models in the legal field.

By considering these recommendations and implementing appropriate safeguards, legal professionals can confidently leverage the benefits of OpenAI and other software services while protecting sensitive information, upholding attorney-client privilege, and ensuring confidentiality.

In the next section, we will conclude our discussion on OpenAI data practices and their implications for the legal field and provide a preview of the follow-up article on legal ethics considerations.

VII. Conclusion

In this article, we have explored the fascinating world of Natural Language Processing models, delved into OpenAI's data collection and storage practices, and compared these practices with other popular platforms in the legal field. We have also addressed the legal and ethical concerns surrounding using OpenAI's models and provided recommendations for mitigating potential risks.

We have discussed OpenAI's data practices align with industry standards and share similarities with other platforms legal professionals commonly use. By understanding the underlying technology and implementing appropriate safeguards, lawyers can confidently leverage the power of OpenAI's models to enhance their work while protecting sensitive information and upholding attorney-client privilege and confidentiality.

I should clarify that I am not endorsing or condemning any of the privacy practices listed above. I am trying to move the legal profession forward in its understanding of technology.

As we navigate the ever-evolving landscape of law and technology, legal professionals must stay informed and adapt to new developments. In our follow-up article, we will dive deeper into the state of the law today and analyze the legal ethics considerations surrounding using OpenAI's models and other advanced technologies in the legal field.

Looking for a GPT alternative with better confidentiality practices and tailor-made for lawyers?

Check out QuantumQuill, by Promise Legal Tech

Side-by-side comparisons of OpenAI and other tech

OpenAI Technologies

OpenAI Subprocessors
Subprocessor Name Purpose Location
Microsoft Corporation Cloud infrastructure Worldwide
OpenAI affiliates Services and support United States
Snowflake Data warehousing United States
TaskUS Human annotation of data for service improvement Worldwide
ChatGPT Wappalyzer Results (Technology Stack)
Type Resource
Static Site Generator
Issue Trackers
Programming Languages
UI Frameworks
JavaScript Frameworks
Web Servers
Google Analytics
JavaScript Libraries
Cloudflare Bot Management
Cloudflare Turnstile
Open Graph

Westlaw Technologies

Westlaw Practical Law Subprocessors
(UK list because US-based list was not found)
Name Address
Microsoft Corporation One Microsoft Way, Redmond, WA 98052, USA
Refinitiv 30 South Colonnade-Canary Wharf, London E14 5EP, United Kingdom
Amazon Web Services, Inc. 410 Terry Avenue North, Seattle, WA 98109-5210, USA
EPAM Systems, Inc. 41 University Drive, Suite 202, Newtown, PA 18940, USA
Adestra Ltd Holywell House, Osney Mead, Oxford, OX2 0ES, United Kingdom, Inc. The Landmark @ One Market, Suite 300, San Francisco, CA 94104, USA
SAP America, Inc. 3999 West Chest Pike, Newtown Square PA 19073, USA
Westlaw Wappalyzer Results (Technology Stack)
Type Resource
JavaScript Graphics
UI Frameworks
JavaScript Frameworks
Twitter Ads
Google Ads
Google Analytics
Facebook Pixel
Linkedin Insight Tag
Google Ads Conversion Tracking
Adobe Analytics
Customer Data Platform
Adobe Experience Platform Identity Service
JavaScript Libraries
jQuery UI
Live Chat
Marketing Automation
Adobe Audience Manager
Tag Managers
Adobe Experience Platform Launch
Open Graph
Comments by