Novice knowledge scientists typically have the notion that each one they should do is to search out the best mannequin for his or her knowledge after which match it. Nothing might be farther from the precise apply of information science. Actually, knowledge wrangling (additionally referred to as knowledge cleaning and knowledge munging) and exploratory knowledge evaluation typically eat 80% of an information scientist’s time.
Regardless of how straightforward knowledge wrangling and exploratory knowledge evaluation are conceptually, it may be laborious to get them proper. Uncleansed or badly cleansed knowledge is rubbish, and the GIGO precept (rubbish in, rubbish out) applies to modeling and evaluation simply as a lot because it does to another facet of information processing.
What’s knowledge wrangling?
Knowledge hardly ever is available in usable type. It’s typically contaminated with errors and omissions, hardly ever has the specified construction, and often lacks context. Knowledge wrangling is the method of discovering the info, cleansing the info, validating it, structuring it for usability, enriching the content material (probably by including data from public knowledge corresponding to climate and financial situations), and in some instances aggregating and remodeling the info.
Precisely what goes into knowledge wrangling can range. If the info comes from devices or IoT units, knowledge switch is usually a main a part of the method. If the info might be used for machine studying, transformations can embrace normalization or standardization in addition to dimensionality discount. If exploratory knowledge evaluation might be carried out on private computer systems with restricted reminiscence and storage, the wrangling course of might embrace extracting subsets of the info. If the info comes from a number of sources, the sector names and models of measurement may have consolidation via mapping and transformation.
What’s exploratory knowledge evaluation?
Exploratory knowledge evaluation is carefully related to John Tukey, of Princeton College and Bell Labs. Tukey proposed exploratory knowledge evaluation in 1961, and wrote a e-book about it in 1977. Tukey’s curiosity in exploratory knowledge evaluation influenced the event of the S statistical language at Bell Labs, which later led to S-Plus and R.
Exploratory knowledge evaluation was Tukey’s response to what he perceived as over-emphasis on statistical speculation testing, additionally referred to as confirmatory knowledge evaluation. The distinction between the 2 is that in exploratory knowledge evaluation you examine the info first and use it to counsel hypotheses, quite than leaping proper to hypotheses and becoming strains and curves to the info.
In apply, exploratory knowledge evaluation combines graphics and descriptive statistics. In a highly cited book chapter, Tukey uses R to explore the 1990s Vietnamese economy with histograms, kernel density estimates, box plots, means and standard deviations, and illustrative graphs.
ETL and ELT for data analysis
In traditional database usage, ETL (extract, transform, and load) is the process for extracting data from a data source, often a transactional database, transforming it into a structure suitable for analysis, and loading it into a data warehouse. ELT (extract, load, and transform) is a more modern process in which the data goes into a data lake or data warehouse in raw form, and then the data warehouse performs any necessary transformations.
Whether you have data lakes, data warehouses, all the above, or none of the above, the ELT process is more appropriate for data analysis and specifically machine learning than the ETL process. The underlying reason for this is that machine learning often requires you to iterate on your data transformations in the service of feature engineering, which is very important to making good predictions.
Screen scraping for data mining
There are times when your data is available in a form your analysis programs can read, either as a file or via an API. But what about when the data is only available as the output of another program, for example on a tabular website?
It’s not that hard to parse and collect web data with a program that mimics a web browser. That process is called screen scraping, web scraping, or data scraping. Screen scraping originally meant reading text data from a computer terminal screen; these days it’s much more common for the data to be displayed in HTML web pages.
Cleaning data and imputing missing values for data analysis
Most raw real-world datasets have missing or obviously wrong data values. The simple steps for cleaning your data include dropping columns and rows that have a high percentage of missing values. You might also want to remove outliers later in the process.
Sometimes if you follow those rules you lose too much of your data. An alternate way of dealing with missing values is to impute values. That essentially means guessing what they should be. This is easy to implement with standard Python libraries.
The Pandas data import functions, such as read_csv()
, can replace a placeholder symbol such as ‘?’ with ‘NaN’. The Scikit_learn class SimpleImputer()
can replace ‘NaN’ values using one of four strategies: column mean, column median, column mode, and constant. For a constant replacement value, the default is ‘0’ for numeric fields and ‘missing_value’ for string or object fields. You can set a fill_value
to override that default.
Which imputation strategy is best? It depends on your data and your model, so the only way to know is to try them all and see which strategy yields the fit model with the best validation accuracy scores.
Feature engineering for predictive modeling
A feature is an individual measurable property or characteristic of a phenomenon being observed. Feature engineering is the construction of a minimum set of independent variables that explain a problem. If two variables are highly correlated, either they need to be combined into a single feature, or one should be dropped. Sometimes people perform principal component analysis (PCA) to convert correlated variables into a set of linearly uncorrelated variables.
Categorical variables, usually in text form, must be encoded into numbers to be useful for machine learning. Assigning an integer for each category (label encoding) seems obvious and easy, but unfortunately some machine learning models mistake the integers for ordinals. A popular alternative is one-hot encoding, in which each category is assigned to a column (or dimension of a vector) that is either coded 1 or 0.
Feature generation is the process of constructing new features from the raw observations. For example, subtract Year_of_Birth from Year_of_Death and you construct Age_at_Death, which is a prime independent variable for lifetime and mortality analysis. The Deep Feature Synthesis algorithm is useful for automating feature generation; you can find it implemented in the open source Featuretools framework.
Feature selection is the process of eliminating unnecessary features from the analysis, to avoid the “curse of dimensionality” and overfitting of the data. Dimensionality reduction algorithms can do this automatically. Techniques include removing variables with many missing values, removing variables with low variance, Decision Tree, Random Forest, removing or combining variables with high correlation, Backward Feature Elimination, Forward Feature Selection, Factor Analysis, and PCA.
Data normalization for machine learning
To use numeric data for machine regression, you usually need to normalize the data. Otherwise, the numbers with larger ranges might tend to dominate the Euclidian distance between feature vectors, their effects could be magnified at the expense of the other fields, and the steepest descent optimization might have difficulty converging. There are several ways to normalize and standardize data for machine learning, including min-max normalization, mean normalization, standardization, and scaling to unit length. This process is often called feature scaling.
Data analysis lifecycle
While there are probably as many variations on the data analysis lifecycle as there are analysts, one reasonable formulation breaks it down into seven or eight steps, depending on how you want to count:
- Identify the questions to be answered for business understanding and the variables that need to be predicted.
- Acquire the data (also called data mining).
- Clean the data and account for missing data, either by discarding rows or imputing values.
- Explore the data.
- Perform feature engineering.
- Predictive modeling, including machine learning, validation, and statistical methods and tests.
- Data visualization.
- Return to step one (business understanding) and continue the cycle.
Steps two and three are often considered data wrangling, but it’s important to establish the context for data wrangling by identifying the business questions to be answered (step one). It’s also important to do your exploratory data analysis (step four) before modeling, to avoid introducing biases in your predictions. It’s common to iterate on steps five through seven to find the best model and set of features.
And yes, the lifecycle almost always restarts when you think you’re done, either because the conditions change, the data drifts, or the business needs to answer additional questions.
Copyright © 2021 IDG Communications, Inc.