Aquaboutic | Focus Security Research | Vulnerability Exploit | POC


one article for you to understand the characteristics of engineering!

Posted by muschett at 2020-04-10

By bhalchandra madhekar

Translation: Chen Zhiyan

Checked by: Zhang Ling

This article is about 1800 words. It is recommended to read for 7 minutes.

This paper describes a typical standard machine learning pipeline based on cross industry standard process, as the standard process model of data mining industry.

Regardless of its scale and size, data has become the first-class assets of modern enterprises, companies and organizations. Any intelligent system needs data-driven, no matter how complex it is. At the core of every intelligent system, there are one or more algorithms based on some data learning methods, such as machine learning, deep learning or statistical methods. They use these data to generate knowledge and provide intelligent insight for a period of time.

The algorithm itself is very general, but it can not play an effective role in the ordinary raw data. Therefore, we need to extract meaningful features from the original data so that we can understand and use these data.

Any intelligent data insight system is basically composed of end-to-end pipes:

First, get the original data;

Then we use data processing technology to obtain, process and extract meaningful features and attributes from these data;

Finally, statistical model or machine learning model are usually used to model these features.

If necessary, the model also needs to be deployed for future use based on the problem at hand.

After getting the original data, it is reckless to build the model directly on the data, because we can not get the desired results or performance from the ordinary original data, and the algorithm itself will not automatically extract meaningful features from it. In terms of data preparation indicated in the figure above, after necessary cleaning and preprocessing analysis of the original data, a variety of methods can be used to extract meaningful attributes or features from it. Feature engineering is an art as well as a science, which is why data scientists usually spend 70% of their time on data preparation before modeling.

"Feature engineering is the process of transforming the original data into features, which can better describe potential problems to the prediction model, so as to improve the accuracy of the model to the unseen data."

-Dr. Jason Brownlee

This gives us an in-depth understanding of why feature engineering is a process of transforming data into features as input to machine learning models. In other words, high-quality features help improve the overall performance and accuracy of the model. To a great extent, characteristics are related to basic problems.

Therefore, even though machine learning tasks may be the same in different scenarios, such as classifying Internet of things events as normal and abnormal behaviors, or classifying customer emotions, the features extracted in each scenario will be very different.

What are characteristics?

A feature is usually a specific representation based on the original data. It is a single measurable attribute, usually represented by columns in the dataset. For a general two-dimensional data set, each observation value is represented by a row, each feature is represented by a column, and each observation has a specific value.

Therefore, as in the example above, each row usually represents a feature vector, and all the observed feature sets form a two-dimensional feature matrix, also known as feature set. This is similar to a data box or spreadsheet used to represent 2D data. Machine learning algorithms usually work with these numerical matrices or tensors, so the vast majority of feature engineering technology is to transform the original data into some numerical expression for algorithm understanding.

Features based on data sets can be divided into two categories:

The inherent original features are obtained directly from the data set without additional data operations.

Derived features are usually obtained from feature engineering and extracted from existing data attributes.

For example, by subtracting the order date from the current date, you can create a new order fulfillment date from the order data set containing the order date. On the other hand, in a particular deep learning algorithm, features are usually relatively simple, because the algorithm itself will transform data internally. This method requires a large amount of data at the expense of interpretation. However, in the case of image processing or natural language processing, such a compromise is often worthwhile.

For most other use cases faced by companies, such as predictive analysis, feature engineering is the format needed to transform data into machine learning. The selection of features is very important to the interpretability and performance of the model. Without feature engineering, today's large companies cannot deploy accurate machine learning systems.

Feature Engineering

Numerical data usually describes observation, recording or measurement data in the form of scalar values. In this case, numerical data refers to continuous data rather than discrete data that is usually used to represent classified data. Numerical data can also be vector values, where each value or entity in the vector can represent a specific feature. Integers and floating-point numbers are the most common and widely used numeric data types in continuous numeric data.

Even if the numerical data can be directly input into the machine learning model, it is still necessary to design features related to the scene, problem and domain before building the model. Therefore, the demand for feature engineering still exists.

Original title: Feature Engineering

Original link:

Translator's profile

Chen Zhiyan, graduated from communication and control engineering of Beijing Jiaotong University with a master's degree in engineering, has successively served as an engineer of Great Wall computer software and system company and an engineer of Datang microelectronics company, and now is the technical support of Beijing Wuyi Chaoqun Technology Co., Ltd. At present, he is engaged in the operation and maintenance of intelligent translation teaching system, and has accumulated some experience in artificial intelligence deep learning and natural language processing (NLP). In my spare time, I like to translate and create. The translated works mainly include: iec-iso 7816, Iraq petroleum engineering project, declaration of new taxation, etc. among them, the Chinese-English translation of "Declaration of new taxation" was officially published in global times. I can use my spare time to join the translation volunteer group of Thu data sending platform, and I hope to share with you and make progress together

Recruitment information of translation team

Work content: need a careful heart to translate the selected foreign articles into fluent Chinese. If you are a foreign student of data science / statistics / computer science, or engaged in relevant work abroad, or have confidence in your foreign language level, welcome to join the translation team.

You can get: regular translation training to improve the translation level of volunteers, improve the awareness of the forefront of data science, overseas friends can keep in touch with the development of domestic technology application, and the background of Thu data school industry research brings good development opportunities for volunteers.

Other benefits: Data scientists from famous enterprises, Tsinghua University and overseas students will be your partners in the translation team.

Click "read the original" at the end of the article to join the data team~

Instructions for reprinting

If you need to reprint, please indicate the author and the source (from: datapi ID: datapi) in the prominent position of the beginning of the article, and place the conspicuity QR code of datapi at the end of the article. For original logo, please send the name of the article - the name and ID of the authorized official account to the mailbox, apply for the white list authorization and edit it as required.

Please post the link back to the contact email (see below). We will investigate the legal liability of those who reprint or adapt without permission.

Click "read the original" to embrace the organization