What is Data Labeling

Taxonomy of available data labeling technologies

Sandeep Uttamchandani

--

Image credit: Unsplash

Need for Data Labeling Tools

The key to ML is the availability of “right” data. “Right” data is a combination of right features/metrics, right distribution (IID) in the raw data, and the right labeling of the data samples.

The need for labeled data is dependent on the type of ML algorithm i.e., supervised learning requires labeled samples for training models. In 2020, the image/ video segment accounted for over 35% of the global revenue for data collection and revenue. Data labeling saw growth in all sectors, and particularly the healthcare industry with increasing use of AI applications for diagnostic automation, treatment prediction, gene sequencing, drug development, and so on.

Getting data samples is often expensive. Self-supervised learning is an active area of research to leverage the vast amount of unlabelled data by setting learning objectives so as to get supervision from the data itself. While this is promising in the context of generic use-cases, developing models for specialized tasks still require some amount of labeled data samples (zero and few-shot learning).

A common myth is that supervised ML requires a large amount of data. The amount of data required depends on several factors namely…

--

--

Sandeep Uttamchandani

Sharing 20+ years of real-world exec experience leading Data, Analytics, AI & SW Products. O’Reilly book author. Founder AIForEveryone.org. #Mentor #Advise