Data Exploration for ML
General Info
Useful Snippets

nan.0 Data Exploration for ML

Description:

Categorical Problems

Null or Missing Values Definition: Column contains missing values.

    *Test:*
                    ?

    *Solution:* 
            1. Create a missing value indicator column to indicate that the column was originally missing
            2. Create a new value, "null", and replace missing values with the "null" value
            3. Impute the most frequently occurring value in the column into the missing value.
            4. If the column contains too many missing values, just drop it entirely
            5. Bin column into "Missing" vs "Non-Null"

High Cardinality Definition: Too many different values in the categorical column. Each value gets it's own column which generates too many features for the model

    *Test:*
            ?

    *Solutions:* 
            1. Create a missing value indicator column to indicate that the column was originally missing
            2. Create a new value, "null", and replace missing values with the "null" value
            3. Impute the most frequently occurring value in the column into the missing value

Dummy Variables Needed Definition: Categorical values need to be converted to Numerics so that they work with the modeling program

    *Test:*
        ?

    *Solutions:* 
            Convert categorical values to dummy variables

Numeric Problems

Null or Missing Values Definition: Missing numeric values can cause errors when computing statistics on the column

    *Test:*
            ?

    *Solutions:* 
            1. Impute missing values with the mean
            2. Impute missing values with the mode
            3. Impute missing value with prediction of missing value
            4. Convert missing value to categorical and have missing values be a category

Data Isn't Scaled Definition: For some algorithms to work, data must be scaled and appear, for example as a value between -1 and 1

    *Test:*
            ?

    *Solutions:* 
            1. Use a scaler to transform the column so that it's scaled

Non-Normal Distribution Definition: Many models are based on the assumption that data is normally distributed, which may not be the case.

    *Test:*
            ?

    *Solutions:* 
            1. Transform the data to make it normally distributed

Outliers Definition: Column contains small numbers of values that are way outside the normal range of values for the column.

    *Test:*
            ?

    *Solutions:* 
            1. Cap values within a few distributions of the norm, and replace outliers with capped values

Feature Selection

Redundant Features Definition: Features are highly correlated and adding additional column contains small numbers of values that are way outside the normal range of values for the column.

    *Test:*
            Create a correlation matrix to identify correlated features

    *Solutions:* 
            1. Combine correlated features
            2. Drop some of the correlated features
            3. Use PCA to combine the features

Irrelevant Features Definition: Features are unrelated to the problemn and are not useful

    *Test:*
            ?

    *Solutions:* 
            1. Drop irrelevant features