2026-04-05 · Don Ho · 1200 words

GitHub Just Made Your Code Microsoft's Training Data. You Have 19 Days to Stop It.

On March 25, GitHub announced that starting April 24, all interaction data from Copilot Free, Pro, and Pro+ users will be used to train Microsoft's AI models. You are opted in by default. If you do nothing, every code snippet you send to Copilot, every suggestion you accept, every file name and repository structure the tool touches during your session becomes training data for Microsoft's CoreAI strategy. Copilot Business and Enterprise users are excluded. Everyone else is fair game.

GitHub's Chief Product Officer Mario Rodriguez framed the change as necessary to improve model performance. That may be true. It is also true that Microsoft just converted 40 million developers into an unpaid data pipeline, and the opt-out is buried in a settings page that isn't even accessible from the mobile app.

What GitHub Is Actually Collecting

The scope is broader than most developers realize. When the training data setting is enabled (which it is, by default), GitHub collects: accepted or modified outputs from Copilot suggestions, inputs and code snippets sent to Copilot, code context surrounding the cursor position, comments and documentation in the active file, file names and repository structure, navigation patterns within the project, interactions with Copilot features including chat and inline suggestions, and thumbs up/down feedback on suggestions.

GitHub draws a distinction between code "at rest" (stored in your repository, which they say they don't access for training) and code "in session" (actively sent to Copilot while you're working). That distinction sounds reassuring until you think about what "in session" actually means. If you're using Copilot regularly, every active file in every repository you work in during a Copilot session is potentially in scope. The model sees your proprietary architecture, your naming conventions, your domain-specific patterns, and your business logic.

The collected data may also be shared with "GitHub affiliates," defined as companies in the same corporate family. That means Microsoft and its subsidiaries. Third-party model providers are excluded from receiving this data for their own training, but that limitation applies to external partners, not to Microsoft itself.

The IP Problem Nobody's Talking About

Here is the issue that should keep every general counsel awake tonight.

Individual Copilot users within an organization typically do not have the authority to license their employer's source code to a third party. That is a basic principle of IP ownership. If you write code as part of your employment, your employer owns that code (in most jurisdictions, under work-for-hire doctrine). You cannot unilaterally grant Microsoft a license to use it for training data.

Yet GitHub's opt-out mechanism is enforced at the individual user level, not the organization level. A single developer on your team who uses a personal Copilot Free or Pro account and doesn't toggle the setting off has potentially exposed your proprietary codebase to Microsoft's training pipeline.

GitHub's FAQ partially addresses this: interaction data from users whose accounts are members of a paid organization will be excluded from model training, and data from paid organization repositories is never used regardless of the user's tier. That sounds comprehensive. But it depends on every developer using their org-linked account for all work, never working on company code from a personal account, and never contributing to a company project from a personal Copilot session. In practice, those boundaries are porous. Developers work from personal machines. They test code locally. They use personal accounts for side projects that overlap with work.

The Competitive Intelligence Angle

One Reddit commenter put it plainly: "When you use Copilot, you're not just getting suggestions. You're implicitly teaching the model what good code looks like in your domain. Your proprietary patterns, architecture decisions, domain-specific idioms, naming conventions, all get folded into a general model. That model then improves suggestions for everyone else, including your direct competitors who use the same tool."

This is not paranoia. It is the business model working exactly as designed. Microsoft trains the model on your patterns, then sells better suggestions to your competitors. The data flows one direction: from your proprietary codebase into a general-purpose model that benefits everyone who pays for the subscription. The value extraction is the product.

GitHub acknowledges the dynamic indirectly by noting that Microsoft, Anthropic, and JetBrains take similar approaches to using interaction data for model training. The fact that the industry has converged on this practice does not make it acceptable. It makes it an industry-wide IP problem.

GDPR and International Exposure

For companies with European operations or employees, the GDPR question is immediate. GitHub claims "legitimate interest" as its lawful basis for processing interaction data. Under GDPR Article 6(1)(f), legitimate interest requires a balancing test: the controller's interest must not be overridden by the data subject's rights and freedoms.

Training commercial AI models on developers' proprietary code, without affirmative consent, to benefit the controller's competitive position is a difficult legitimate interest argument. The Article 29 Working Party (now the EDPB) has consistently held that legitimate interest does not apply when the processing is unexpected from the data subject's perspective or when a less intrusive alternative exists. An opt-in model is clearly less intrusive. GitHub chose opt-out. That choice will be tested.

What to Do Before April 24

For individual developers: go to github.com/settings/copilot/features. Under the "Privacy" section, disable "Allow GitHub to use my data for AI model training." Do this now. Do not wait.

For engineering leaders: audit every developer's Copilot tier and account configuration across your organization. Ensure all developers are using org-linked accounts with Copilot Business or Enterprise tiers that are excluded from training data collection. If any developer is using a personal Free, Pro, or Pro+ account for any company work, that is an immediate policy gap.

For general counsel: update your acceptable use policies for AI coding tools. Require org-managed accounts for all coding AI tools. Prohibit use of personal-tier AI coding assistants on company code. Add Copilot data training settings to your quarterly IT compliance audit. If your company has European operations, assess GDPR exposure from any developer who has already been using Copilot without opting out.

For procurement teams: if you're evaluating or renewing Copilot licenses, the training data policy is now a negotiation point. Ask Microsoft directly: will our interaction data be used for model training under any tier, any circumstance, any edge case? Get the answer in writing. Put it in the contract.

The deadline is April 24. Nineteen days. Every day a developer on your team uses Copilot without opting out is another day of proprietary code flowing into Microsoft's training pipeline. The setting takes 30 seconds to change. The IP exposure from not changing it could last years.