The Data Concepts

Aug 22, 2022

Topics Covered: Data, Data Vs Information, DIKW Pyramid, Different Aspects of Data (Formats, Scope, Biases), Structured, Semi-structured and Unstructured Data, Data Usage (Scientific Research, Business Management, Finance, Governance), Data Analysis

DATA

Data is the backbone of data-driven AI. So let’s first understand what is data.

Data is the raw fact without any context i.e. a number, symbol, character, word, codes, graphs, etc.

Data has originated as a plural form of Latin word ‘Datum’, which means ‘a given fact’.

Broadly speaking, it can be any information in digital form, it can be the output of a sensing device or organ.

Loosely, data and information are used interchangeably, which is not correct, we will cover the difference in an upcoming section.

Data, information, knowledge, and wisdom are closely related concepts, but each has its own role in relation to the other, and each term has its own meaning, we will also touch on this part soon.

References:

https://harvard-iacs.github.io/2020-CS109A/lectures/lecture02/slides/Lecture02_Data.pdf
https://docs.microsoft.com/en-us/learn/modules/explore-core-data-concepts/2-identify-need-data-solutions
https://en.wikipedia.org/wiki/Data

DATUM, DATA AND DATASET

Mostly we talk about data but occasionally, you may hear terms like datum or dataset, lets's understand the difference. A datum is a single piece of information, which can be treated as an observation. Data is plural of datum, which we can say multiple observations. Dataset is a homogenous collection of data (each datum must have the same focus).

References:

https://harvard-iacs.github.io/2020-CS109A/lectures/lecture02/slides/Lecture02_Data.pdf

INFORMATION

When data is processed and put into context, it becomes information, which can be utilized by humans in a significant way i.e. making decisions, forecasting, etc

References:

https://en.wikipedia.org/wiki/Information

KNOWLEDGE AND WISDOM

When we put relevant information to work in a specific domain, it becomes knowledge. And when that knowledge is enhanced with first-hand experience, it becomes wisdom.

Let’s relate it to an example:

‘100’ number is data
‘100 miles’ is information
‘100 miles is quite a far distance is a knowledge
‘100 miles is very difficult to walk’ is wisdom.

References:

https://en.wikipedia.org/wiki/DIKW_pyramid

DIFFERENT ASPECTS OF DATA

TYPES OF DATA

While working on these data, it is important to know the types of data to process them and get the right results. There are two types of data: Qualitative and Quantitative data.

Qualitative data can further be classified into nominal and ordinal, while Quantitative data can be classified into discrete and continuous.

FORMATS OF DATA

We can classify data formats into three categories structured, semi-structured, and unstructured:

Structured data has a definite structure like a table with rows and columns.
Semi-structured data has some structure like JSON, key-value, or graph database.
Unstructured data has no specific structure like photos, audio, and video files.

References:

https://docs.microsoft.com/en-us/learn/modules/explore-core-data-concepts/2-identify-need-data-solutions
https://harvard-iacs.github.io/2020-CS109A/lectures/lecture02/slides/Lecture02_Data.pdf

SCOPE OF DATA

Data can be classified into two categories based on scope:

Population, which means we have access to all the data
Sample, which means only a portion is available or feasible

We don’t have access to all the data in most of the cases, in these cases, we collect the sample in a way that contains most of the information from the population so that we can estimate the patterns in population from that sample.

References:

https://harvard-iacs.github.io/2020-CS109A/lectures/lecture02/slides/Lecture02_Data.pdf

BIASES IN DATA

Bias in data means over or under-representation of a sub-population, which may not be intentional.

These are the types of biases that exist in data:

Omission: using arguments from only one side
Source selection: including more authoritative sources from one side
Story selection: sharing stories that agree with one side
Placement: unimportant stories get important placement in reputed media platforms
Labeling: labeled on one side or missing labels on other side
Spin: stories providing only one interpretation of an event

References:

https://harvard-iacs.github.io/2020-CS109A/lectures/lecture02/slides/Lecture02_Data.pdf

DATA USAGE

Data is used in the following fields:

Scientific Research: Factual data is both, an essential resource and a valuable output
Business Management: Data helps understand and improve processes
Finance: Whoever has the best and the fastest information gains the edge
Governance: Open data platform to help promote data-driven governance

References:

https://en.wikipedia.org/wiki/Data

DATA ANALYSIS

Data analysis is a process for obtaining raw data and converting it into information useful for decision-making by users.

Data is collected and analyzed to answer questions, test hypotheses, or disprove theories.

These are the steps of a typical data analysis process:

Data requirements: to understand what input would be required for analysis
Data collection: to collect those inputs from various sources
Data processing: to process or organize data for analysis
Data cleaning: to deal with incomplete, inaccurate, redundant elements
Exploratory data analysis: to explore data and understand the patterns
Data product: to convert data into actionable inputs
Communication: to convey the results of the analysis to users

References:

https://en.wikipedia.org/wiki/Data_analysis

Ankit Rathi is a Cloud Data Technologist, published author & well-known speaker. His interest lies primarily in building end-to-end data/AI applications/products following best practices of Data Engineering and Architecture.

The Weekend Tradevestor

The Data Concepts

DATA

DATUM, DATA AND DATASET

INFORMATION

KNOWLEDGE AND WISDOM

DIFFERENT ASPECTS OF DATA

TYPES OF DATA

FORMATS OF DATA

SCOPE OF DATA

BIASES IN DATA

DATA USAGE

DATA ANALYSIS