01 December 2024

Towards Standardization of Data Licenses: The Montreal Data License, Benjamin et al (2019)

Overview: The brief accomplishes two tasks. 1st) It explores the intellectual underpinnings that prevent full use of data by: (i) market participants who want access to data without paying the creators of that generated data, and: (ii) for creators who want to make their data freely available without loss of control of data rights to others. 2nd) The authors go on to propose a novel data licensing regime to address the shortcomings of current approaches.


The regime "focuses on contracts for accessing databases rather than recognizing specific legal statuses for databases." It is a good first step towards creating a rigorous regime for open source data usage licenses.





  • Part (1) Introduces policy interest in unambiguous data licensing
  • Part (2) Describes current licensing regimes
  • Part (3) Describes taxonomy regime to activate use of MDL
  • Part (4) Discusses caveats and exceptions
  • Part (5, 6, 7) Presents Conclusions, References & Appendices).


Comment: The Montreal Data License (MDL) is a first iteration of an approach that tackles the problem of licensing data as a free good. The authors propose to standardize terminology and licensing standards, to make data available in a manner that is similar to the Free & Open Source Software licensing regime. The authors note that “While metadata can help reduce some of these costs, it often lacks the legal clarity that the needed to define how data can be used.” In this context, the authors propose MDL to create economic conditions that “enable AI and machine learning (ML) growth that benefits everyone.”


There are two items of concern:


There is a claim that “progress made in ML and AI should also be reflected by licenses that reflect the iterative process that move fundamental research to commercially available products and solutions, akin to the foundational moments of license creation for FOSS;” and “For the benefits of ML and AI to be accessible and benefit a wider realm of humanity, other market participants need to be on a level playing field.” 

This affects the value of the paperIt is reasonable to create “a more transparent, predictable market for data with clear legal language as its underpinning” But there does not seem to be the consideration of the rippling effects of the proposed regime. For example: 

    • Issue #1 - Market Effects: This proposal appears to support the idea that markets ought to be manipulated to create a falsely-constructed "level playing field" that is designed to favour a single class of market participant (fundamental researchers). If this is what is being suggested, that might not be acceptable to other market participants.
    • Issue #2 - Ecological/Emissions Effects of False Equivalency: FOSS data goods are not equivalent to FOSS software goods. FOSS data goods are more likely to have no owner, or lose their owner, and thereby contribute to the blight on the planet that is waste data [Note: OrbMB is developing a process to help cut data waste holdings - Ed.]. 


It would be prudent to develop supplementary licensing frameworks to address these issues. For example: The MDL FOSS-Data Regime interest group might want to consider the nature of custodianship--does a FOSS Data licensee accept legal responsibility for chain-of-custody, cost management, emissions cost management, and duty-to-delete the asset copy? 

===============================

(2019) Misha Benjamin et al, TOWARDS STANDARDIZATION OF DATA LICENSES: THE MONTREAL DATA LICENSE, Misha Benjamin1, Paul Gagnon 1; Negar Rostamzadeh 1; Chris Pal 1,2,3; Yoshua Bengio3,4,5; Alex Shee 1. 1 Element AI; 2 Polytechnique MontrĂ©al; 3 MILA; 4 Canada CIFAR AI Chair; 5 Senior CIFAR Fellow, arXiv:1903.12262v1 & https://arxiv.org/abs/1903.12262  (Accessed Q3/4-2024)

===============================

Part (1) Introduction (p.1-2)

This paper introduces a taxonomy for data licensing in artificial intelligence (AI) and machine learning (ML). The aim is to create a standardized framework similar to open-source software licensing.

Drawing a parallel between data and oil markets, the paper highlights the critical role of data in powering AI and ML systems; comparing the absence of standardized and regulated data acquisition and processing to the resource-intensive nature of oil extraction, refining and delivery. Unlike O&G markets, the data market lacks such frameworks, which the authors propose require heavy regulation to reduce friction, ensure security, and build public trust. 

The authors argue that a new licensing regime will create “fairer and more efficient markets for data” as the new regime will more clearly “define how data can be used in the fields of AI and ML.” A “new family of data rights is organized as a new form of license called the Montreal Data License (MDL) and there is a web-based tool for generating these licenses.

While metadata can help reduce some of these costs, it often lacks legal clarity to properly define how data can be used. The authors call for broader access to AI benefits, claiming that a more transparent, predictable data market with clear legal language is needed. This paper discusses the challenges of current data licensing terms and proposes a taxonomy that better aligns with AI and ML, aiming to clarify the use of data and associated rights, providing a framework for database creators to generate clearer licensing terms; to provide access to data for research purposes.

===============================

Part (2) Licensing barriers to use of data in ML and AI (p.3-5)

A review of commonly used databases in AI research (Appendix 1, Document page 12) reveals a "patchwork" of vague licensing terms that create uncertainty about the permissions granted for their use. This gives rise to barriers to usability such as:

(i) Lack of Nuance on “Use”: The right to use is granted, without defining what “use” actually means; this creates “one homogenous notion of use” and this creates downstream problems.

(ii) Commercial vs Non-Commercial Use: By way of example, many of the licenses cited in Appendix 1 contain a restriction against commercial use. It is the opinion of the authors that this is problematically ambiguous;

(iii) Barriers to Research: Pure-play academic researchers are increasingly unable to access datasets because of ambiguity and cost.

(iv) Lack of Uniformity: Terminology is not uniform and standardized. Free and Open Source Software (FOSS) communities have built a software-sharing commons by constructing a standard terminology and rules for software use. Data might move to the same regime. 

(v) Share-Alike Requirements: Certain datasets are made available with licensing terms that make them difficult to use in AI/ML. For example, “the notion of derivative work is ill defined” in the Creative Commons Share Alike license (CC-SA) regime.

(vi) Licensing Language Requires Standardization and Context-Appropriate Adaptations to ML and AI: [the use of this statement is unclear – see “Comment”, above]. 


The lack of clarity make it difficult for database users to determine if their intended use cases fall within the granted permissions, leading to unpredictability and the need for further analysis.

This in turn increases transaction costs, as resources, time, and expertise are required to assess whether a database can be used for a particular purpose. The paper aims to resolve these conceptual ambiguities using a constructed taxonomy that clarifies and standardizes data licensing terms.

===============================

Part (3) Taxonomy underlying the MDL (p.5-9)

The MDL approach is to create a Use Case Taxonomy, where each Use Case is a framework that uses defined legal rights to use data in ML and AI modeling . These are the definitions:

  • Data: Raw Data, including metadata, which provides structural details about the data.
  • Labelled Data: Data enriched with metadata labels or tags, which help models make sense of the data.
  • Model: Refers to machine learning (ML) or AI algorithms used to derive insights or predictions.
  • Untrained Model: The model which has not been exposed to data to optimize its parameters.
  • Trained Model: The model has been exposed to data which has been used to train the model.
  • Representation: A transformed version creates the means to use the model without touching the original data.
  • Output: The results generated by applying a trained model to data.

Application to a "Market Trading" use case
  1.  Evaluating Models: With this license, the licensee can train and test various versions of a model using the data to assess performance, but cannot use the output for stock trading or modify the model’s structure (e.g., keeping the trained model's weights).
  1. Research Use: This license allows the licensee to train models and create new datasets based on the data, with the restriction that any resulting models or datasets are subject to the same limitations. The license allows experimentation without commercializing the outcomes unless a separate paid license is negotiated.
  1. Publishing Research: This right permits the licensee to publish all research results, including those generated by models trained on the data, under the same restrictions as the original data. It clarifies the rights for both academic and private entities to advance the field of ML/AI and addresses ambiguity about commercialization rights in academic contexts.
  1. Internal Use: This license allows the licensee to train models and use them for trading proprietary capital for profit, but restricts sharing or selling the predictions to third parties.
  1. Output Commercialization: With this right, the licensee can commercialize the output by offering a service like a stock prediction API or trading third-party capital, but cannot sell or distribute the model itself.
  1. Model Commercialization: This license allows the licensee to both commercialize the model and its output, including selling the model itself, such as offering perpetual licenses for customers to modify and distribute the model.

Possible Restrictions
  • Designated Third Parties: Limiting data use to specific entities.
  • Sub-licensing: Restricting sublicensing the data to third parties, preventing contractors from using the data.
  • Attribution / Confidentiality: Licensors could require or prevent attribution.
  • Ethical Considerations: Ethical clauses could limit the data's use in certain fields, such as healthcare or military applications, to address concerns over the impact of data use.


Application 

The definitions are used to explain treatment of a database of historical equities trades.

===============================

Part (4) Caveats and Examples (p.9-10)

Licensors may include additional restrictions; rhe authors notinf that data licensing for AI/ML could be subject to external rule-making and controls, such as:

  • Legal Frameworks and Copyright: The underlying data may be subject to rights that are distinct from the licensing of use of the data.
  • Ill-acquired Data: There may be issues such as violations of privacy and owners' rights. 
  • No Property Rights in Data: The paper does not propose that data should be treated as property with inherent rights.
  • Database Rights vs. Copyright: "The paper distinguishes between specific database rights (e.g., the EU’s Database Directive) and copyright statutes (e.g., in Canada), noting that its framework focuses on contracts for accessing databases rather than recognizing specific legal statuses for databases." 

===============================

Parts (5 & 6) Conclusion & References (p.10-11)

This section outlines the resources and tools provided in the paper to promote clearer data licensing for AI and ML; and a list of references is included.

===============================

Appendices (p.12-16)

  • Appendix 1: Overview of commonly used datasets
  • Appendix 2: Summary of rights granted in conjunction with Models
  • Appendix 3: Top Sheet for Licensed Rights

  • Appendix 4: CC-BY4 Montreal Data License (MDL) attribution notice & cautions

09 November 2024

Policy Brief: What is the Value of Data? - Coyle & Manley (2022)

Overview: This brief explores various methods for valuing data, acknowledging limitations in existing models and emphasizing the need for more comprehensive approaches. The Typology of valuation methods in Part (2) below (and page 4 in brief) is especially helpful.

Comment: The global economy has been reshaped by data, with data-driven firms becoming dominant market leaders. This transformation extends to both private and public sectors. Although data’s value is widely acknowledged, there remains no consensus on how to quantify that value. This lack of clarity hinders optimal investments and governance.

This report builds on prior work (reviewed earlier in this blog - Ed) at the Bennett Institute for Public Policy (Coyle et al 2020). This article covers development of new methods of data value measurement.

Coyle, D. and A. Manley, Policy Brief: What is the Value of Data? A review of empirical methods, July 2022, Bennett Institute for Public Policy, University of Cambridge, July 2022: https://www.bennettinstitute.cam.ac.uk/wp-content/uploads/2022/07/policy-brief_what-is-the-value-of-data.pdf

  • Part (1) Discusses the value of economically valuable data to public and private sectors
  • Part (2) Reviews proposed data valuation methodologies
  • Part (3) Presents a framework to develop an estimate of value

===============================

Part (1) Introduction

In recent years, data has become a key driver of economic transformation, with data-driven companies making up seven of the top 10 firms globally by market capitalization in 2021. This shift is particularly evident in the growing productivity and profitability gap between data-intensive firms and others.

Data’s value is increasingly recognized across sectors, including the public sector. Despite this recognition, no consensus has developed to empirically measure the value of data, hindering its full potential. While many firms and investors acknowledge the value of data, particularly through data services and stock market evaluations, the absence of clear valuation methods makes it difficult to guide investment decisions or govern data usage effectively. Coyle & Manley's report explores various approaches to data valuation; and highlights the challenges of incorporating factors like opportunity costs, risks, and the costs associated with data collection and storage into such assessments.

===============================

Part (2) Proposed Data Valuation Methodologies

Coyle (2020) presented the Lens framework, describing data through the Economic Lens and the Information Lens. 

Building on prior research, the report identifies key characteristics influencing data's value and outlines valuation approaches. Traditional methods—cost-based, income-based, and market-based—are commonly used but fail to fully account for other inputs. 

The report also highlights newer approaches which improve capture of data's broader economic value, such as ascribing value using data flows and marketplaces. Comparative methods are cited, including:

• Internet of Water: A taxonomy of various data valuation methods (2018).

• Ker and Mazzini (2020) identified four different methods: (a) cost-based; (b) income-based approach; (c) market capitalisation; and (c) trade flows.

• OECD Going Digital Toolkit: Estimates the value of data, summarizes the System of National Accounts cost-based frameworks being adopted by governments; and summarized other approaches.

The methods are summarized as:

  • 2.1 Cost-based Methods
  • 2.2 Income-based Methods
  • 2.3 Market-based methods (Marketplaces, Market capitalisation, Data Flows) 
  • 2.4 "Ambiguity-driven" methods (term coined by blogger - Ed)
  • 2.5 Impact-Based Methods

-----------------------------

2.1 Cost-based Methods

These methods calculate the costs of generating, storing, and replacing data, providing a lower-bound estimate of its value. Variants like the Modified Historical Cost Method (MHCM) adjust for data characteristics, and the consumption-based method reflects usage rates. National statistical offices, such as Statistics Canada and the UK Office for National Statistics, have trialed this method. Cost-based methods are widely used for valuing data, rooted in the System of National Accounts (SNA) (1).

[Government interest is to define the value to generate taxes - Ed]. Their challenge is that "national level cost-based approaches rely on having well-classified data at the microlevel. This will be difficult to achieve and there are several blurred lines that make classification harder." (p.5-7) 

2.2 Income-based Methods

These methods estimate data's value through expected revenue streams generated by the data, such as selling marketing analytics. A common approach is the "relief from royalty" method, which estimates savings from owning data rather than licensing it. However, challenges arise in distinguishing data's contribution to revenue, especially for firms where data enhances products rather than being sold directly. This method also introduces uncertainty as it relies on judgment.

2.3 Market-based methods

These methods use observable prices for data, though such prices are rare since most data is used internally. When available, market prices offer valuable insights but reflect only a partial estimate of the broader social value of data. Key academic approaches include using data marketplaces, market capitalization of firms, and global data flows to estimate value. However, limitations remain, especially when data is aggregated or traded in complex ecosystems like credit scoring, where the true value often exceeds the sum of its parts.

2.3.1: Data Marketplaces

The literature on data marketplaces explores their potential to increase the value of data by reducing transaction costs, improving pricing transparency, and allowing multiple users to derive value from the same datasets. However, the success of such initiatives has been inconsistent, with key challenges including complex pricing mechanisms, regulatory differences, and a lack of trust. Data suppliers often bundle datasets and set prices based on consumer willingness to pay, but much of the literature remains theoretical and idealistic. Case studies from China, New Zealand, the EU, and Colombia show that effective data pricing and trust in data quality are critical for success, yet low participation often undermines marketplace efforts. Historical examples, such as Microsoft's failed Azure DataMarket, highlight the difficulty in building customer interest, while current platforms like the Shanghai Data Exchange and Ocean Market demonstrate varying approaches to data transaction and pricing. Ultimately, while data marketplaces hold significant potential, barriers such as trust, pricing, and regulation continue to limit their effectiveness.

2.3.2 Market capitalization-based

These methods are used to value data by examining its impact on a firm's market value, particularly for data-driven companies. This approach estimates the worth of these firms by looking at their overall market capitalization, which includes the value derived from data and analytics. For example, Ker and Mazzini (2020) estimate that U.S. data-driven firms, identified through lists like "The Cloud 100," are collectively worth over $5 trillion. Coyle and Li (2021) further build on this by analyzing how data-driven companies, such as Airbnb, disrupt traditional firms like Marriott, leading to the depreciation of incumbents' organizational capital. The decline in the value of non-digital firms' organizational capital due to data-driven competition helps estimate how much these firms should be willing to pay for data. Overall, these methods provide a way to quantify the value of data in relation to firm competitiveness and valuation in data-intensive industries.

2.3.3: Data Flows

Data flows are valuable in markets with limited information, as they can be observed and analyzed, with a strong correlation between the volume of data flow and its value on dominant online platforms. However, quantifying global data flows is challenging because most assessments only account for data that crosses international borders. While Ker and Mazzini (2020) and Coyle and Li (2021) suggest that the link between data flow volume and data value in specific locations is weak, due to large data hubs serving broad areas and the need for local knowledge, they highlight the economic significance of the content in data flows over volume alone. For example, video streaming generates more traffic than e-commerce but contributes less economic value. Their research also underscores the economic importance of digitally deliverable products, noting that many countries still lack a framework for categorizing digital trade. Overall, while data flows help understand data's economic value, their measurement is still evolving, hindered by definitional and geographical complexities. 

-----------------------------

Section 2.4: Ambiguity Methods: 

This section discusses Experiments and Surveys to estimate value where market prices do not exist. 

-----------------------------

2.5: Impact-Based Methods

The intent of data collection is to develop stories from which to mine insights for decision-making. Impact-based methods assess the value of data using cause-and-effect measurement. These create greater value-add, "making their value more persuasive than traditional quantitative approaches." These methods use testing, such as comparative scenarios, to test response. Slotin (2018) reviews five data valuation methodologies, favoring impact-based approaches for their clarity and communicative strength. These include:

(a) Empirical Studies: Arrieta-Ibarra et al. (2020) estimated that "data use accounts for up to 47% of Uber’s revenue. In a scenario where drivers are fully compensated for their data, this could equate to $30 per driver per day for data generation."

(b) Decision-Based Valuation: A variant of empirical-based studies, this method "adjusts the value of data based on factors like frequency, accuracy, and quality before weighing outcomes by their contribution to decisions. This method acknowledges that value derives from improved decision-making and considers alternatives to using the data, although it requires subjective judgment."

(c) Shapley values: This variant ignores the determination of value-add of insights. Instead, Shapley values represent a subset of impact-based methods that focus on valuing data in its raw form rather than solely based on its applications in data-driven insights. Originating from game theory, Shapley values provide a unique payoff solution within a public good game, ensuring group rationality, fairness, and additivity. 

Shapley values are used in computer science to evaluate the contribution of individual data points to model performance, assess data quality, and optimize feature selection. They are also applied to determine compensation for data providers by quantifying the value of their data.  Although Shapley values provide a useful framework for valuing data, they represent only one possible solution to public good problems, with alternative approaches that might offer better properties. The method has advantages, such as identifying valuable data for collection, but also faces drawbacks, including high computational costs and challenges in translating value into monetary terms. 

(d) Direct measurable economic impact: Various studies analyze the growth and jobs impact.

(e) Stakeholder-based methods: These methods analyze the value to the sector supply chain of data availability: "This is a wider definition of value and may include value upstream or downstream omitted from other methods; it can encompass the non-rival aspect of data. The data consultancy Anmut is one that has developed this method and provides a case study of their valuation of Highways England data (Anmut n.d.). The challenge of this approach is that it requires professional judgment of value vs. auditable measurements of value.

(f) Real options analysis: Real options analysis provides a method for estimating the value of data by considering its potential future use cases rather than its current applications. This flexibility enables firms to capitalize on positive opportunities while minimizing downside risks. Data is considered non-rival, meaning its value does not diminish with use, and its potential applications can remain undefined at the time of collection. The option value represents the "right but not the obligation" to generate insights from data in the future, allowing firms to collect data for unknown future purposes. The value here is that this incentivizes waiting to determine data value in advance; by assessing the impact of more new information (policy changes, tech changes, shifts in consumer preferences) before deciding whether to analyze the data. 

===============================

3. Discussion

Policymakers are stuck because there is no "consensus or best method for valuing data." The authors instead propose setting up a schema to classify methods and to validate them with external surveys. The schema goal is to determine: 

  1. What is being valued?
  2. Who is valuing the data?
  3. When is the valuation taking place?
  4. What is the purpose of the valuation?
-----------------------------

3.1 What is being valued?

There are a variety of things that could be referred to as ‘data.’ The possible distinctions are illustrated in the ‘data value chain,’ (see blog review of Coyle, 2020) which sets out different stages from the generation of raw data up to the decisions made using data insights generating the potential end-user value. The authors note that: "In general, the raw data is of least interest, and some of the literature goes as far as to state that raw data does not hold any value on its own...(as) even with cost-based methods, in many ways the most straightforward approach, it is almost impossible to distinguish between costs associated with raw data generation and database formation." 

[Note: OrbMB's ORBintel method illuminates the costs and we forecast developing the means to establish the value of raw data that is to be collected, in advance of the need - Ed.]

-----------------------------

3.2 Who is Valuing the Data? 

This section explores how the perspective of different stakeholders affects the methods used to value data (Data Producers, Private Sector Producers, Data Users, Data Hubs, Public Sector vs. Private Sector Valuation, Intangible Asset and Productivity).

The key point is that Data serves as an intangible asset that provides firms with a productivity advantage, especially when coupled with complementary skills. Firms that capture monopoly rents from data tend to value it higher than alternative users, leading to a divergence between private and social valuations of data.

Valuation varies significantly based on the perspectives of different stakeholders, with public sector approaches emphasizing societal value and the monopoly rent that is taxation, while private sector valuations focus on costs and impacts. Understanding these differing perspectives is crucial for accurately assessing data's overall worth.

-----------------------------

3.3: When is the Valuation Taking Place? 

This section distinguishes between ex ante (before the event) and ex post (after the event) data valuation methods: Ex Ante Valuations are fraught with uncertainty, therefore risk and cost; therefore less widely used. Ex Post Valuations mitigate the uncertainties.

-----------------------------

3.4 What is the purpose of the valuation? 

The authors contend that the purpose influences the choice of methodologies, with different approaches suited to different goals. 

===============================

References:

(1) SNA is the United Nations (UN) framework setting “the internationally agreed standard set of recommendations on how to compile measures of economic activity.” The framework establishes consistent accounting rules and classifications, to make multistate comparison possible. The next update (2025) will include an update on data valuation. 



09 October 2024

More than Dual-Use?

We've accepted an invitation to join the Peachscore+Gust Data-driven Accelerator. This month's post was to be a review of Cambridge researcher Diane Coyle's (2022) valuation analysis; and we are juggling. So, for this month, a short post about an observation about market sectors.

The Root Taxonomy of Goods and Services - More than Dual-Use?

Goods and services generally get classed into three sector classes: 

  • Civilian
  • Military/national security
  • Dual-Use (use cases in both sectors)

This has never been really exactly true, as there are numerous goods and services that are not purely civilian consumer (individual & household) offers. These include various public administration and safety segments ranging from police and rescue to wastewater and pothole maintenance.

The better structure might be to say that there are five sector classes:

  1. Civilian;
  2. Civil Aid; and 
  3. Military/National Security:
where:
  • #1, #2, and #3 are (#4) Triple-Use; and
  • #2 and #3 are (#5) Dual-Use:

Consider the carabiner: 


Comments?





Towards Standardization of Data Licenses: The Montreal Data License, Benjamin et al (2019)

Overview:  The brief accomplishes two tasks. 1st) It explores the intellectual underpinnings that prevent full use of data by: (i) market pa...