Overview: The brief accomplishes two tasks. 1st) It explores the intellectual underpinnings that prevent full use of data by: (i) market participants who want access to data without paying the creators of that generated data, and: (ii) for creators who want to make their data freely available without loss of control of data rights to others. 2nd) The authors go on to propose a novel data licensing regime to address the shortcomings of current approaches.
The regime "focuses on contracts for accessing databases rather than recognizing specific legal statuses for databases." It is a good first step towards creating a rigorous regime for open source data usage licenses.
- Part (1) Introduces policy interest in unambiguous data licensing
- Part (2) Describes current licensing regimes
- Part (3) Describes taxonomy regime to activate use of MDL
- Part (4) Discusses caveats and exceptions
- Part (5, 6, 7) Presents Conclusions, References & Appendices).
Comment: The Montreal Data License (MDL) is a first iteration of an approach that tackles the problem of licensing data as a free good. The authors propose to standardize terminology and licensing standards, to make data available in a manner that is similar to the Free & Open Source Software licensing regime. The authors note that “While metadata can help reduce some of these costs, it often lacks the legal clarity that the needed to define how data can be used.” In this context, the authors propose MDL to create economic conditions that “enable AI and machine learning (ML) growth that benefits everyone.”
There are two items of concern:
There is a claim that “progress made in ML and AI should also be reflected by licenses that reflect the iterative process that move fundamental research to commercially available products and solutions, akin to the foundational moments of license creation for FOSS;” and “For the benefits of ML and AI to be accessible and benefit a wider realm of humanity, other market participants need to be on a level playing field.”
This affects the value of the paper. It is reasonable to create “a more transparent, predictable market for data with clear legal language as its underpinning” But there does not seem to be the consideration of the rippling effects of the proposed regime. For example:
- Issue #1 - Market Effects: This proposal appears to support the idea that markets ought to be manipulated to create a falsely-constructed "level playing field" that is designed to favour a single class of market participant (fundamental researchers). If this is what is being suggested, that might not be acceptable to other market participants.
- Issue #2 - Ecological/Emissions Effects of False Equivalency: FOSS data goods are not equivalent to FOSS software goods. FOSS data goods are more likely to have no owner, or lose their owner, and thereby contribute to the blight on the planet that is waste data [Note: OrbMB is developing a process to help cut data waste holdings - Ed.].
It would be prudent to develop supplementary licensing frameworks to address these issues. For example: The MDL FOSS-Data Regime interest group might want to consider the nature of custodianship--does a FOSS Data licensee accept legal responsibility for chain-of-custody, cost management, emissions cost management, and duty-to-delete the asset copy?
===============================
(2019) Misha Benjamin et al, TOWARDS STANDARDIZATION OF DATA LICENSES: THE MONTREAL DATA LICENSE, Misha Benjamin1, Paul Gagnon 1; Negar Rostamzadeh 1; Chris Pal 1,2,3; Yoshua Bengio3,4,5; Alex Shee 1. 1 Element AI; 2 Polytechnique Montréal; 3 MILA; 4 Canada CIFAR AI Chair; 5 Senior CIFAR Fellow, arXiv:1903.12262v1 & https://arxiv.org/abs/1903.12262 (Accessed Q3/4-2024)
Part (1) Introduction (p.1-2)
This paper introduces a taxonomy for data licensing in artificial intelligence (AI) and machine learning (ML). The aim is to create a standardized framework similar to open-source software licensing.
Drawing a parallel between data and oil markets, the paper highlights the critical role of data in powering AI and ML systems; comparing the absence of standardized and regulated data acquisition and processing to the resource-intensive nature of oil extraction, refining and delivery. Unlike O&G markets, the data market lacks such frameworks, which the authors propose require heavy regulation to reduce friction, ensure security, and build public trust.
The authors argue that a new licensing regime will create “fairer and more efficient markets for data” as the new regime will more clearly “define how data can be used in the fields of AI and ML.” A “new family of data rights is organized as a new form of license called the Montreal Data License (MDL) and there is a web-based tool for generating these licenses.
While metadata can help reduce some of these costs, it often lacks legal clarity to properly define how data can be used. The authors call for broader access to AI benefits, claiming that a more transparent, predictable data market with clear legal language is needed. This paper discusses the challenges of current data licensing terms and proposes a taxonomy that better aligns with AI and ML, aiming to clarify the use of data and associated rights, providing a framework for database creators to generate clearer licensing terms; to provide access to data for research purposes.
===============================
Part (2) Licensing barriers to use of data in ML and AI (p.3-5)
A review of commonly used databases in AI research (Appendix 1, Document page 12) reveals a "patchwork" of vague licensing terms that create uncertainty about the permissions granted for their use. This gives rise to barriers to usability such as:
(i) Lack of Nuance on “Use”: The right to use is granted, without defining what “use” actually means; this creates “one homogenous notion of use” and this creates downstream problems.
(ii) Commercial vs Non-Commercial Use: By way of example, many of the licenses cited in Appendix 1 contain a restriction against commercial use. It is the opinion of the authors that this is problematically ambiguous;
(iii) Barriers to Research: Pure-play academic researchers are increasingly unable to access datasets because of ambiguity and cost.
(iv) Lack of Uniformity: Terminology is not uniform and standardized. Free and Open Source Software (FOSS) communities have built a software-sharing commons by constructing a standard terminology and rules for software use. Data might move to the same regime.
(v) Share-Alike Requirements: Certain datasets are made available with licensing terms that make them difficult to use in AI/ML. For example, “the notion of derivative work is ill defined” in the Creative Commons Share Alike license (CC-SA) regime.
(vi) Licensing Language Requires Standardization and Context-Appropriate Adaptations to ML and AI: [the use of this statement is unclear – see “Comment”, above].
The lack of clarity make it difficult for database users to determine if their intended use cases fall within the granted permissions, leading to unpredictability and the need for further analysis.
This in turn increases transaction costs, as resources, time, and expertise are required to assess whether a database can be used for a particular purpose. The paper aims to resolve these conceptual ambiguities using a constructed taxonomy that clarifies and standardizes data licensing terms.
===============================
Part (3) Taxonomy underlying the MDL (p.5-9)
The MDL approach is to create a Use Case Taxonomy, where each Use Case is a framework that uses defined legal rights to use data in ML and AI modeling . These are the definitions:
- Data: Raw Data, including metadata, which provides structural details about the data.
- Labelled Data: Data enriched with metadata labels or tags, which help models make sense of the data.
- Model: Refers to machine learning (ML) or AI algorithms used to derive insights or predictions.
- Untrained Model: The model which has not been exposed to data to optimize its parameters.
- Trained Model: The model has been exposed to data which has been used to train the model.
- Representation: A transformed version creates the means to use the model without touching the original data.
- Output: The results generated by applying a trained model to data.
- Research Use: This license allows the licensee to train models and create new datasets based on the data, with the restriction that any resulting models or datasets are subject to the same limitations. The license allows experimentation without commercializing the outcomes unless a separate paid license is negotiated.
- Publishing Research: This right permits the licensee to publish all research results, including those generated by models trained on the data, under the same restrictions as the original data. It clarifies the rights for both academic and private entities to advance the field of ML/AI and addresses ambiguity about commercialization rights in academic contexts.
- Internal Use: This license allows the licensee to train models and use them for trading proprietary capital for profit, but restricts sharing or selling the predictions to third parties.
- Output Commercialization: With this right, the licensee can commercialize the output by offering a service like a stock prediction API or trading third-party capital, but cannot sell or distribute the model itself.
- Model Commercialization: This license allows the licensee to both commercialize the model and its output, including selling the model itself, such as offering perpetual licenses for customers to modify and distribute the model.
- Designated Third Parties: Limiting data use to specific entities.
- Sub-licensing: Restricting sublicensing the data to third parties, preventing
contractors from using the data.
- Attribution /
Confidentiality: Licensors
could require or prevent attribution.
- Ethical Considerations: Ethical clauses could limit the data's use in certain fields, such as healthcare or military applications, to address concerns over the impact of data use.
Application
The definitions are used to explain treatment of a database of historical equities trades.
===============================
Part (4) Caveats and Examples (p.9-10)
Licensors may include additional restrictions; rhe authors notinf that data licensing for AI/ML could be subject to external rule-making and controls, such as:
- Legal Frameworks and Copyright: The underlying data may be subject to rights that are distinct from the licensing of use of the data.
- Ill-acquired Data: There may be issues such as violations of privacy and owners' rights.
- No Property Rights in Data: The paper does not propose that data should be treated as property with inherent rights.
- Database Rights vs. Copyright: "The paper distinguishes between specific database rights (e.g., the EU’s Database Directive) and copyright statutes (e.g., in Canada), noting that its framework focuses on contracts for accessing databases rather than recognizing specific legal statuses for databases."
===============================
Parts (5 & 6) Conclusion & References (p.10-11)
This section outlines the resources and tools provided in the paper to promote clearer data licensing for AI and ML; and a list of references is included.
===============================
Appendices (p.12-16)
- Appendix 1: Overview of commonly used datasets
- Appendix 2: Summary of rights granted in conjunction with Models
- Appendix 3: Top Sheet for Licensed Rights
- Appendix 4: CC-BY4 Montreal Data License (MDL) attribution notice & cautions