Sunday, March 8, 2026

AI Training Data Questions: What Nepali Businesses Should Know Before Scraping Content

AI tools are now part of everyday business. Nepali companies use them for marketing copy, product descriptions, customer support, analytics, software development, and internal knowledge systems. But before a business starts scraping websites, articles, images, code repositories, or databases to train an internal model or build an AI-powered product, it needs to confront a hard legal reality:

Nepal does not yet have a clear, AI-specific rulebook for training data.

That does not mean scraping is risk-free. It means the risk must be managed through existing copyright principles, contract rules, platform terms, confidentiality obligations, and general digital-law exposure. Nepal’s copyright framework still centers on human-created works and does not expressly regulate AI training data, while current commentary and recent scholarship repeatedly note this gap. (Common Law Chambers)

For founders, content businesses, SaaS companies, media houses, and agencies in Nepal, the key question is not only “Can we scrape it?” The smarter question is “What legal risks are we taking on if we do?”




Why This Issue Matters for Businesses in Nepal

Training data is not just a technical input. It is a legal exposure point.

If a business scrapes third-party articles, product photos, videos, user reviews, source code, or closed communities, the risks can include copyright claims, breach of website terms, complaints from rights holders, reputational fallout, and disputes with customers or investors over the provenance of the model’s outputs. Recent Nepali legal commentary and academic writing both emphasize that Nepal lacks clear statutory rules for AI training data, ownership, and liability, which means businesses are operating in an uncertain environment rather than a permission-free one. 

This matters even more if your business is:

  • building a fine-tuned model using scraped Nepali-language content

  • ingesting client data into external AI systems

  • training an internal chatbot on third-party material

  • scraping competitors’ websites, catalogs, or blogs

  • using scraped code or documentation for AI-assisted development


The Current Legal Position in Nepal

Nepal’s Copyright Act, 2059 (2002) remains the main legal reference point for copyrighted material, and the Copyright Registrar’s Office continues to publish the Act and registration materials. But the statute does not currently set out a dedicated framework for AI model training, text-and-data mining, or machine-learning scraping. Recent Nepal-focused legal commentary and academic research both describe this as a major gap in the law. 

In practical terms, that means several things are true at once:

1. Nepal does not clearly authorize AI training on copyrighted works

There is no explicit Nepali equivalent of a broad text-and-data-mining exception for AI training in the materials cited above. The legal silence creates uncertainty, not automatic permission. 

2. Nepal also does not clearly prohibit every act of scraping in one single AI-specific rule

Instead, businesses have to assess risk under existing copyright law, contract principles, platform rules, and other applicable digital laws. 

3. Human authorship still matters in Nepal’s copyright system

Recent Nepal-focused commentary and scholarship note that the Copyright Act is built around human authorship and does not clearly define AI authorship or ownership. That affects both outputs and the legal analysis around training practices. For businesses, the takeaway is simple: do not treat legal silence as a green light.


Scraping vs Copyright: The Two Questions Businesses Must Separate

A lot of companies collapse everything into one issue. That is a mistake. There are usually two separate legal questions.

First question: Was collecting the material lawful?

This may involve:

  • website terms of use

  • technical access restrictions

  • contractual limits

  • confidentiality obligations

  • privacy or sector-specific compliance

  • possible digital-law exposure

For example, some websites expressly ban automated extraction or scraping in their terms. One Nepal-based publisher’s published terms prohibit automated content extraction and commercial reuse without written permission. That does not create a universal rule for all sites, but it is a strong reminder that scraping can trigger contractual issues even before copyright analysis begins. 

Second question: Was using the material for AI training lawful?

That is the copyright and licensing issue. Even if content is publicly accessible, that does not automatically mean it is free to copy into datasets, tokenize into training corpora, or use for commercial model development. Nepal has not clearly answered this yet. A business can therefore face risk on both fronts at the same time.


The Main Risk Categories Nepali Businesses Should Assess

Copyright risk

If scraped materials include protected articles, books, images, music, videos, product descriptions, or software code, training on them may raise infringement questions, especially if the resulting system reproduces, summarizes too closely, or generates outputs that resemble the original works. Nepal-focused legal commentary specifically flags the lack of clear rules on AI training data and the possibility that AI systems may use copyrighted materials without permission. 

This is especially sensitive for:

  • news content

  • educational materials

  • stock photography

  • design assets

  • premium databases

  • code repositories

  • subscription content

Terms-of-use and contract risk

A site’s terms may prohibit scraping, automated extraction, commercial reuse, or republication. That means your issue may become contractual even if the copyright question stays unsettled. Published website terms in Nepal already show this pattern. 

Confidentiality and trade secret risk

If your team scrapes or uploads non-public client documents, internal manuals, partner materials, or paywalled content into an AI workflow, you may also create trade secret or confidentiality exposure. This is often the overlooked risk.

Output risk

Even when the training set is mixed and the legal theory is uncertain, the real-world dispute often starts at the output level. If your tool generates text, images, or code that substantially resembles someone else’s work, your business may be the one receiving the complaint.

Investor and due diligence risk

Sophisticated investors increasingly ask where training data came from, whether licenses were obtained, whether the model was built on open or closed materials, and whether any takedown or infringement disputes exist. A company with weak data provenance often becomes harder to fund, acquire, or partner with.


Publicly Available Does Not Mean Free to Train On

This is one of the biggest misunderstandings.

Businesses often assume that if content is visible on the open web, it is safe to scrape for AI. That assumption is too aggressive. Public availability and legal permission are not the same thing. Even in ongoing international litigation, debates continue over whether publicly available content can be used for training without a license. Reuters reported, for example, on litigation in India involving allegations of scraping and disputes over whether publicly available content may be used for AI training.

That dispute is not Nepal law. But it is highly relevant because it shows the exact controversy Nepali businesses may walk into if they treat public access as blanket authorization.


Special Risk Areas by Content Type

1. Scraping articles, blogs, and news content

This is high-risk because news and long-form text are clearly copyrightable in their expression, and many publishers also impose terms on reuse and automated extraction. Nepal-focused commentary already warns that the law does not clarify AI training use of copyrighted works. If you are building:

  • a Nepali-language LLM

  • a media-monitoring AI tool

  • a knowledge base trained on publisher content

  • a chatbot trained on competitors’ blogs

you should assume meaningful legal exposure unless you have a license or a very strong legal basis.

2. Scraping images and design assets

Images create a double problem. They are highly protectable works, and they are also easy to match visually in outputs. If your business uses scraped images to train a design generator or product-visualization model, the downstream similarity risk is substantial.

3. Scraping code

Code can be protected by copyright, and repositories may also be governed by open-source licenses or platform terms. If your company trains on code, you must assess both copyright and license compatibility. This matters particularly for SaaS startups and agencies using AI coding assistants in product development.

For broader software-rights issues, connect this topic to Copyright for Software in Nepal: Who Owns the Code — Founders, Employees, or Freelancers?

4. Scraping user-generated content

Reviews, comments, community posts, and forum material may look casual, but they may still be protected, and they can also contain personal data, confidential information, or platform-specific restrictions.

5. Scraping paywalled, subscription, or database content

This is among the highest-risk categories. The commercial nature of the source, the likely contractual restrictions, and the evidentiary clarity around unauthorized extraction all make disputes more likely.


What About Fair Dealing or Public Interest?

Businesses sometimes ask whether educational or news-style exceptions could justify training.

At present, Nepal’s fair-dealing-style exceptions are not a safe basis for broad commercial AI scraping. Those exceptions are traditionally framed around limited socially useful uses such as education, news reporting, research, criticism, and commentary, not mass ingestion of copyrighted content into commercial AI systems. Since Nepal has not yet created a clear AI-training exception, companies should be very cautious about stretching those doctrines too far. The current scholarship and commentary highlighting the legal gap support that caution. 


A Practical Risk Framework Before You Scrape

Before scraping anything, a Nepali business should work through five questions.

1. What exactly are we collecting?

Make a clear list of content categories:

  • text

  • images

  • video

  • audio

  • code

  • metadata

  • user content

  • client content

2. Who owns or controls it?

Is it:

  • your own material

  • licensed material

  • open-license material

  • public-domain material

  • third-party copyrighted content

  • content behind terms or paywalls

3. What do the source terms say?

Check:

  • scraping bans

  • commercial-use limits

  • API restrictions

  • attribution requirements

  • data-mining clauses

  • reproduction prohibitions

4. What is the intended AI use?

There is a big difference between:

  • search indexing

  • internal analytics

  • summarization

  • fine-tuning

  • foundation-model training

  • commercial model resale

The more transformative and large-scale the use, the more important the legal review becomes.

5. Can we defend provenance later?

If an investor, regulator, court, customer, or rights holder asks where the dataset came from, can you answer clearly?

If the answer is no, you have a governance problem.


Safer Alternatives to Raw Scraping

Many businesses do not actually need aggressive scraping.

Safer routes often include:

Licensing content directly

This is slower upfront but far cleaner legally.

Using first-party data

Your own documents, customer support logs, internal knowledge bases, and owned media are usually the best starting point, subject to privacy and confidentiality controls.

Using materials with clear open licenses

This still requires reading the license terms carefully, especially for commercial use and derivatives.

Using APIs or official feeds

Where available, APIs often reduce terms-of-use disputes compared with unsanctioned scraping.

Building retrieval systems instead of training on everything

Sometimes a retrieval layer is legally and commercially safer than using third-party content for model training.


Contractual Safeguards Businesses Should Use

If your business is buying AI services, outsourcing model work, or building with vendors, contracts matter a lot.

Your agreements should address:

  • training data provenance

  • warranties that the vendor has rights to use the data

  • indemnity for IP claims

  • restrictions on using your confidential data for model training

  • output ownership and reuse terms

  • record-keeping and audit rights

Recent Nepal-focused compliance commentary also recommends provenance, liability, and local enforcement protections in AI procurement and deployment contracts. 

This topic also pairs well with Using AI Tools in Business: How to Avoid IP Liability (Images, Text, Code) and AI-Generated Content and Copyright in Nepal: Who Owns It Right Now?


Practical Advice for Businesses

If you are a Nepali business thinking about scraping content for AI training, these are the most practical rules to follow right now.

Do not start with the assumption that public content is free content.
Do not scrape first and ask legal questions later.
Do not rely on Nepal’s current legal silence as if it were a defense.
Do not mix client-confidential materials into external AI tools without express review.

Instead:

  • map every data source before ingestion

  • review site terms and access restrictions

  • prioritize owned, licensed, or clearly open materials

  • keep a written dataset provenance log

  • document how your AI system is trained or grounded

  • build human review into high-risk outputs

  • use contracts to push risk back to vendors where appropriate

For businesses building serious AI products, legal review should happen before dataset assembly, not after launch.

Axcel Law can be positioned naturally here as experienced counsel for Nepali startups, media businesses, and technology companies navigating copyright, contracts, licensing, and AI-related IP risk.


Conclusion

Nepali businesses are right to explore AI. But scraping content for training is not just a technical shortcut. It is a legal decision with real intellectual-property consequences.

Right now, Nepal does not offer a clear statutory safe harbor for AI training data. The Copyright Act remains in force, the Copyright Registrar’s Office continues to administer the existing regime, and Nepal-focused scholarship consistently says the law has not yet caught up to AI training, authorship, and liability questions.

That means the prudent business approach is not blind confidence. It is controlled risk management.

If your company wants to build with AI, the safest path is to know your data sources, respect licenses and terms, reduce dependency on questionable scraping, and structure your contracts and workflows so you can defend your choices later. In this area, good compliance is not bureaucracy. It is product protection.

0 comments:

Post a Comment