06 December 2023

And what of the waste?



 
We can discuss tangible vs. intangible. What about the waste? Especially the intangible waste? We can touch everything that data touches, but not data itself. Which leads to serious issues when preparing data for service. It is like managing fog.

Consider "80%' - the reported time cost to clean data.

Data has no current value; so how can we say we know the cost to clean it? Is the 80% rule-of-thumb is accurate; hyperbole, or as UK Data Scientist Leigh Dodds puts: “bullshit stats”? .

The source of the rule-of-thumb appears to originate with a citation error in a 2018 Harvard Business Review article:

  • “Yet today, most data fails to meet basic “data are right” standards. Reasons range from data creators not understanding what is expected, to poorly calibrated measurement gear, to overly complex processes, to human error. To compensate, data scientists cleanse the data before training the predictive model. It is time-consuming, tedious work (taking up to 80% of data scientists’ time), and it’s the problem data scientists complain about most.” Thomas C. Redman, If Your Data Is Bad, Your Machine Learning Tools Are Useless, April 02, 2018: https://hbr.org/2018/04/if-your-data-is-bad-your-machine-learning-tools-are-useless
  • This is cited in "AI starts with data, AI Business eBook Series in collaboration with Telus International, 2021: https://resources.aibusiness.com/ai-starts-with-data/
  • Redman errs in citing Edd Wilder-James who states that "80% of the work is acquiring and preparing data"; Edd Wilder-James, Breaking Down Data Silos, December 05, 2016: https://hbr.org/2016/12/breaking-down-data-silos?autocomplete=true.
  • And Wilder-James cites Gil Press who states: "Data preparation accounts for about 80% of the work of data scientists", breaking this into six tasks that data scientists spend most of the time doing (again, not 100%!), with “Cleaning and organizing data: 60%”: Gil Press, Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says; Mar 23, 2016:  https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=1104aab46f63
 
 Does anyone really know? Here are notes assembled from reports:
 



No comments:

Post a Comment

Towards Standardization of Data Licenses: The Montreal Data License, Benjamin et al (2019)

Overview:  The brief accomplishes two tasks. 1st) It explores the intellectual underpinnings that prevent full use of data by: (i) market pa...