*This is the same article I wrote in medium.
More engineers are faced with an increase in the number of projects relating to artificial intelligence(AI) and machine learning (ML). Many AI/ML projects seem to be algorithm oriented but each process to develop AI/ML products are centered on any procedure involving with data. Engineers in AI/ML projects need to understand more about how data can be created and used in AI/ML projects. This article might help junior/middle engineers or data scientists understand the data itself prepared for AI/ML products.
The definition is cited from the online dictionary because the update cycle is faster than paper materials. Data is based on information so the definition of information is also described below.
Information is stimuli that have meaning in some context for its receiver. When information is entered into and stored in a computer, it is generally referred to as data After processing (such as formatting and printing), output data can again be perceived as information.
The definition of data itself is here.
In computing, data is information that has been translated into a form that is efficient for movement or processing. Relative to today’s computers and transmission media, data is information converted into binary digital form. It is acceptable for data to be used as a singular subject or a plural subject. Raw data is a term used to describe data in its most basic digital format.
The meanings of both definitions can be drawn into a picture as described by figure 1. A receiver, which is a human, perceive information under the context surrounding us, measure information, and put it into qualitative/quantitative information in order to easily recognize the meaning of the information. The received information that can easily be recognized by the computer is processed, new information with new insight is created, and passed into a receiver.
image
The types of measurement to translate information into data could be the gray color boxes in the below figure 2.
Categorical data is qualitative data and consists of nominal data and ordinal data. In order to pass such data to code and use data for analyses, categorical data has to be transformed into the numbers such as binary data or arbitrary numbering labels.
On the other hand, numerical data is quantitative data and composed of discrete and/or continuous numbers. Discrete numbers include countable numbers such as the number of students and it can be equal to the counted output of nominal or ordinal data. Continuous numbers can be divided into two types: interval scale and ratio scale. The differences between them are whether the data has “true zero” and the numbers with no minus or not. Strictly speaking, the scale of the continuous variable is not decided based on whether the number has a minus or not, but the way to measure it is easily understandable.
Three ways to describe data are 1) Data Structure, 2) Data Type, and 3) Data Format (=File Format). This section is going to simply summarize them based on commonly used python syntax. Information on infrequently used python syntax is excluded.
List and Dictionaries
>>>tel = {'jack': 4098, 'sape': 4139}
>>>tel['guido'] = 4127
>>>tel
{'jack': 4098, 'sape': 4139, 'guido': 4127}
>>>tel['jack']
4098
>>> del tel['sape']
>>>tel['irv'] = 4127
>>>tel
{'jack': 4098, 'guido': 4127, 'irv': 4127}
>>>list(tel)
['jack', 'guido', 'irv'
Sequences and Tuples
Sequences
>>>t = 12345, 54321, 'hello!'
>>>t[0]
12345
Tuples
>>>t
(12345, 54321, 'hello!')
>>># Tuples may be nested:
...u = t, (1, 2, 3, 4, 5)
>>>u
((12345, 54321, 'hello!'), (1, 2, 3, 4, 5))
>>># Tuples are immutable:
...t[0] = 88888
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment
>>># but they can contain mutable objects:
...v = ([1, 2, 3], [3, 2, 1])
>>>v
([1, 2, 3], [3, 2, 1])
Set
>>>basket = {'apple', 'orange', 'apple', 'pear', 'orange', 'banana'}
>>>print(basket)# show that duplicates have been removed
{'orange', 'banana', 'pear', 'apple'}
>>>'orange'in basket# fast membership testing
True
>>>'crabgrass'in basket
False
In the case of AI/ML projects, the workflow can be described as shown in figure 3.
First, the input data is collected and formatted based on the data translation measurements mentioned in the above section. The input data can be RDBS, CSV, JSON, Excel, HTML, Text, Image, etc. as mentioned in the previous section.
After that, these input data is imported into machine learning API. The machine learning API is usually composed of the three procedures of code related to the preparation code for accessing data, pre-processing data, and machine learning algorithms. The data structure and data type need to be differently manipulated and processed depending on the algorithms before the data is passed into machine learning models.
Finally, the input data, which do not represent valuable meaning clearly for humans, is transformed into valuable information by going through the machine learning APIs shown in figure 3. Then, it is received as useful information by humans.
What is information? - Definition from WhatIs.comInformation is stimuli that has meaning in some context for its receiver. When information is entered into and stored…searchsqlserver.techtarget.com
What is data? - Definition from WhatIs.comIn computing, data is information that has been translated into a form that is efficient for movement or processing…searchdatamanagement.techtarget.com