Data Cleansing in Azure ML Studio (classic)

This project explains data cleaning and preprocessing with Azure machine learning studio (classic).

Predicting Factors Influencing Working Mothers’ Time Investment Decision
(Using Azure Machine Learning Studio for data cleaning)

Demographic changes in society, disruption of technology, and business conditions in the past decade have increased the presence of the women in job markets. This occurrence was a remarkable change in family structure, which has resulted in struggles with balancing between their responsibilities at work and commitment to personal life.

Having a child is one of the responsibilities that will shift great amount of attention and resources away from working parents and cause work-family conflict. This role conflict often entails a reduction in the productive time spent at work, motivation, performance, engagement, job satisfaction, as well as employees’ overall health and happiness.

To eliminate the adverse outcomes attached to work-family conflict, organizations and HR professionals often provide their employees with a package of Work-Family Supplies (WFS). The WFS often consists of policies and resources such as flexible work hours, paid parental leaves, childcare subsidies, on-site childcare, emergency dependent care services, and emergency leaves of absence.

However, organizations’ provision of WFS does not necessarily ensue employees making use of them. The review of the work-family literature indicates that despite availability of WFS, majority of working mothers tend to minimize using these resources and take shorter time off from work.

Prior research suggests that the individuals’ decision regarding the family-leave duration and the use of WFS may be influenced by financial status, family support, the availability of WFS, supportive organization climate, the number of dependents, and mother or child’s general health.

The results of this study household income has a significant influence
on respondents’ decisions about the extent of family leave and use of WFS. In addition, there is a significant relationship between the leave duration and individuals’ turnover intention. That is, individuals with financial needs decide to take shorter maternity leave, and those with shorter leave duration tend to quit their job and eventually stay home with their newborn. Surprisingly, the number of dependents, the existence of family support, and the extent of their availability to provide help and care did not affect mothers’ time investment decisions.

Since this survey was designed for inferential statistics, the result of the probability estimates were not significant. However, the inferred results provided by US department of labor can be a great source of information for HR professionals when provisioning HR policies and resources regarding maternity and parental leaves.

Read the complete article from here

About the Data

The New York City work and family leave survey was conducted in March 2016. This survey was made possible through funding by the U.S. Department of Labor Women’s Bureau.The participants of the research were residents of New York City who gave birth in 2014. The respondents were selected through random sampling and were identified using NYC birth certificates.
The survey was conducted through telephone calls, and 1,063 mothers completed the interview in English and Spanish, which represents a population of 108,000 of approximately 120,000 women who gave birth in 2014. Respondents under 18 years of age and those previously interviewed for another New York City survey were excluded from the sample. The survey aimed to shed light on the availability and accessibility of paid family leave to working parents.
Click here to download the dataset.

Setting the Environment

First, create a Machine Learning Studio Workspace at Microsoft Azure portal. In Azure portal, click on Create a resource (as shown in left side image bellow). You will be redirected to a new page (shown bellow in the image to the right). In the search box type Machine Learning Studio and from the list click on Machine Learning Studio (classic) Workspace and hit the Create bottom.

In properties pane, set the work space name, choose your subscription, and select an existing resource group or create a new one. For the location, choose a center close to  your location. Set a unique name for your storage account to create a new one. Set the workspace pricing tier to standard and tick the box to create a new web service plan and set a unique name for it.  For the web service pricing tier select the DEVTEST Standard and hit the create bottom.

Once the workspace is deployed go to the new resource and click on Launch Machine Learning Studio (classic) to be redirected to your workspace.

To uploaded the WFLS dataset to the workspace, from the left pane click on the Datasets and hit the plus sign to add new.

In new pop up window select the data from your computer, enter a name for the dataset (optional) and select the type of your dataset and indicate if your data has header. Click ok to upload the data.

Next step is to create an experiment. Click on the experiments on the left side pane and create a new experiment.

Preprocessing

To locate the dataset, from the left side pane go to Saved Datasets ➛ My Datasets ➛ WFLS.csv and drag it to the design pane.

To start the data exploration and preprocessing, from the Statistical Functions, add the Summarize Data module to the experiment.

The Summarize Data module gives some information about the data such as number of rows and columns, number of missing values in each column, and count of unique values in a feature column. The module also generates a basic descriptive statistics report for each columns in the dataset (e.g. standard deviation, min/max values, quartiles, mean, median, mode, range, variance, skewness, and kurtosis).   

Connect the module to the WFLS dataset and run the experiment. After the run is completed, right click on the module and from the drop down list go to the Results dataset  ➛ Visualize.

The summary report⬆ shows that the data has 153 columns and 1063 rows, and summarizes the basic statistical scores of each column. The initial issues that needs to be solved are:

  1. The labels of all features are in short code and without the data dictionary no one can make sense of the data.
  2. There are some columns that might not be useful for this research. Having too many features in the dataset can add noise to the result.
  3. All columns have numeric data types, while some features have five or less unique values. The small number of unique values in a column can indicate that the variable might be categorical.
  4. There are too many missing values in the data set.

Change Columns’ Name

Change the columns name by using the data dictionary that comes with the dataset. I only change the name of those columns that are being used for this analysis.

To change the columns name, use the Execute Python Script module  to run Python code.  To configure the module add it to the design pane and connect the data set to the input 1 of the module.

Scroll to the bottom of the Properties pane, and for Python Version, select the version of the Python libraries (Anaconda 4.0/ Python 3.5 ) to use in the script.

Pop out the the script editor and use the following code to change the columns name.

import pandas as pd
def azureml_main(dataframe1 = 'New_York_City_WFLS.csv', dataframe2 = None):
  dataframe1.rename(columns={'el3':'workindustry', 
  'el11':'timeoffType','el15':'returntosamejob', 
  'el16':'reasonofunemplymnt', 'es4':'numberofdependents',  
  'el1':'workedforpay', 'el7':'employerover50', 'el9':'oneyeartenure', 
  'el17a':'LeavedcisionFinancial', 'el17b':'LeavedcisionJoblos', 
  'el17c':'LeavedcisionNotoffered', 'el17d':'LeavedcisionNoteligible', 
  'el17e':'LeavedcisionWasready', 'el17f':'LeavedcisionNeedinsurnce', 
  'cp1':'Familysupport', 'mh2':'diagnoseddeprssion', 
  'el8':'employersize', 'el14':'feelabttimeoff', 
  'cp5':'fmlysprtAvailability', 'ih1':'childHealth', 
  'mh1':'depressionaftrBirth', 'mh4':'generalhealthafter', 
  'es3':'income', 'el12wks':'timeoffinweeks', 
  'el12mns':'timeoffinmonth', 'el13a':'WksofpayEarned', 
  'el13b':'WksofpayTDS', 'el13c':'Wksofpaymaternyleave', 
  'el13e':'Wksofpayothersource', 'el13d':'Wksofpayunpaid'}, 
  inplace=True)
return dataframe1, 

Run the Execute Python Script module to execute the script.

Now , create a subset of the data and exclude the other columns from the dataset. To do this From the left side pane, go to Data Transformation Category ➛ Manipulation ➛ Select Columns in Dataset. Drop the module into the design pane.

Connect the result dataset output of the Python script module to the Select Columns in Dataset module. From the properties pane launch the selector tool to choose the columns that were renamed earlier.

Run the module and after the run is completed, right click on the module and from the drop down list go to the Results dataset  ➛ Visualize.

The visualization ⬆ shows the new name of the columns and the number of columns has changed from 153 to 28.

Next step is cleaning the data from missing values and outliers.

Handling Missing Values

High number of missing data in WFLS dataset can produce misleading results. In this step I try to use different methods for cleaning missing values to maintain the dimensionality and accuracy of the data.

I made a new data summary from the selected columns to carefully analyze the variables and find a best way to treat missing values in each columns.

The summary ⬆ shows that:

1. Almost all columns have maximum value of 77 or 99. According to data dictionary 77 is label for “Don’t Know” (DK) answers and 99 is label for “Refused”.   These are can be treated as missing value and also can help handling missing values in other columns.

Replace Missing values with Zero

2. All 5 weeks of pay columns are keeping the answers for the question: How many weeks-of-pay receive from maternity leave, earned weeks, temporary disability insurance, unpaid, and others?  All columns have the DK and refused option. Thus, if columns have missing value it means respondents did not receive any weeks of pay for those resources. These missing values can be replaced with zero.

To fix this, go to Data Transformation category ➛ Manipulation  ➛ Clean Missing Data. Drag the module to the design pane and connect it to the “select columns in data” module.

From the properties pane launch the column selector and select all 5 columns with the prefixed name Wksofpay. For cleaning mode configuration choose Custom substation value,  set the replacement value to 0, and run the module.

Replace Missing Values with Zero

3. The family supports’ availability has 141 missing values. Reviewing the data shows that values are missing for those respondents who did not have any family member to provide support and care for the child. This is equal to zero availability. Thus, I replaced the missing values with zero, exactly like previous section.

Missing Values in Categorical Data

4. All six columns related to leave decision are binary variables with four unique values of 1: Yes, 2: No, 77: DK, and 99: Refused. The survey question for these 6 columns are: Did any of these situations affect your decision about taking leave from work after your child was born? I could not financially afford to take leave or as much leave as I would have liked, I was afraid I’d lose my job, My job does not have paid leave, I had not built up enough leave time to take any or more time off, I was ready to go back to work, I needed health insurance that is provided by my employer.  There are two ways to handle these columns:

  • a: Hot encoding: these columns is the best way to get rid of missing values and 77/99 values.  
  • b: Replacing values: replace missing, 77, and 99 values with 2. Since the number 2 is the label for No answers in the survey and most common in these categorical values. To do this, follow the steps in section 2, but this time set the replacement value to 2. I will explain two more methods for replacing the 77/99 later in this project.

Here I show both a and b, but for visualization I will work on data frame with the approach b.

a: Hot Encoding

The hot encoding module only works for categorical values. The datatype of all six columns related to leave decision are numeric. Therefore, first we have to convert them to categories: Go to go to Data Transformation category ➛ Manipulation  ➛ Edit Metadata. Drag the module to the design pane and connect it to the cleaned dataset output of the last module. From the properties pane launch the column selector and select all six columns with the prefixed names Leavedcision . For configuration, in categorical option, select the convert to category option and leave the rest as default. Run the module to make the changes.

Now that all six columns are categorical, go to Data Transformation category ➛ Manipulation  ➛ Convert to Indicator Values. Drag the module to the design pane and connect it to the last module. From the properties pane launch the column selector and select all six columns with the prefixed names Leavedcision . Tick the Overwrite categorical columns and run the module.

Delete Entire Rows with Missing Data

5. All columns of employer size, employer with over 50 employees, one-year tenure, time of type and return to the same job have 269 missing values. It is possible that those 269 respondents did not work prior to childbirth. Let create some visualizations with Python script to check it out. Use the following code to build a bar chart.

# Import libraries
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

#Load the dataset as pandas dataframe
data =pd.read_csv('../WFLS Analysis.csv')

#Build a plot
f,ax=plt.subplots(1,2,figsize=(16,8))
s = data['workedforpay'].value_counts()
d=s.plot.bar(color=['#4B8BBE','#BBD500','#FF8A00'],ax=ax[0])
for i, v in s.reset_index().iterrows():
    d.text(i, v.workedforpay + 0.2 , v.workedforpay, color='#820C02')
ax[0].set_title('Number of Respondents Employed before Birth')
ax[0].set_ylabel('Count')

b=sns.countplot('oneyeartenure',hue='returntosamejob',data=data,ax=ax[1])
ax[1].set_title('Number of Respondents with 1 year incumbency')
plt.legend(loc='upper left', fontsize=14, frameon=False, bbox_to_anchor= (0.8, 1))
plt.show() 

The plot on the right shows the number of individuals with (1.0) and without (2.0) 1 year of incumbency. The classes in each category shows whether they have returned to their same job after child birth. The tallest bar in blue indicates that the majority of respondents returned to the same employer.

The plot on the left shows the count of respondents with employment before child birth in blue and those unemployed in green. The number of individuals without a job is 268 + 1 DK respond which matches the number of missing values of 269.

With this information we can confidently remove the entire 269 rows that contains respondents that did not work for pay before giving birth to their child.

Go to Data Transformation category ➛ Manipulation  ➛ Clean Missing Data. Drag the module to the design pane and connect it to the last module. From the properties pane launch the column selector and select any one of the columns employerover50, employersize, oneyeartenure, timeoffType, returnedtosamejob. For cleaning mode configuration choose Remove Entire Rows and run the module.

To make sure that the data must be cleaned and free of missing values, create a new data summary and check the results.

The last step for data cleaning is converting the data type of the column income from numeric to category as we did in step 4.

Feature Engineering

There is only one variable in the dataset that needs feature engineering. There are two columns for length of respondents’ time-off. One is recording the data in number of weeks and the other in months. We can use math functions to unify both values and combine them into one column.

To apply math operation, first replace the missing values of both columns with zero.

Go to Statistical Functions ➛ Apply Math Operation. Drag the module to the design pane and connect it to the last clean-missing-data module. From the properties pane launch the column selector and select the feature ‘timeoffinweeks’ and for the configuration set the category to operations, basic operation to multiply, operation argument type to constant, constant operation argument to 7, and output mode to inplace.

Add another apply math operation module to the pane. This time in column selector choose the feature ‘timeoffinmonth’ , set the constant operation argument to 30, and configure the rest as before. Run both modules.

Now, that values of both columns are in days, create a new feature with sum of the values in both columns. Add the third math operation module to the pane. For the configuration set the category to operations, basic operation to Add, and operation argument type to ColumnSet. In operation argument launch the column selector and select ‘timeoffinweeks’. For column set launch the column selector and select ‘timeoffinmonth’. Set the output mode to Append . Run the module.

Hot encodings and math operation added new columns to the dataset while separating the classes in categorical features. We can remove those extra columns because they are not necessary for this analysis.

From the left side pane, go to Data Transformation Category ➛ Manipulation ➛ Select Columns in Dataset. Drop the module into the design pane. Launch the columns selector and exclude the columns for missing values, 77 and 99 values, and both ‘timeoffinweeks’ and ‘timeoffinmonth’. Run the module.

b: Replace 77 and 99 Values

For categorical values drag and drop the Execute Python Script module to the design pane. Next, use the following code to replace the 77 and 99 values with proper replacement values. Note that replacement value is different for some columns according to the data contained in each.

import pandas as pd
def azureml_main(dataframe1 = 'New_York_City_WFLS.csv', dataframe2 = None):
    dataframe1['LeavedcisionWasready'].replace([77, 99], 2, inplace=True)
    dataframe1['LeavedcisionNeedinsurnce'].replace([77, 99], 2, inplace=True)
    dataframe1['LeavedcisionFinancial'].replace([77, 99], 2, inplace=True)
    dataframe1['LeavedcisionJoblos'].replace([77, 99], 2, inplace=True)
    dataframe1['LeavedcisionNotoffered'].replace([77, 99], 2, inplace=True)
    dataframe1['LeavedcisionNoteligible'].replace([77, 99], 2, inplace=True)
    dataframe1['income'].replace([77, 99], 12, inplace=True)
    #(1, is yes)
    dataframe1['employerover50'].replace([77, 99], 1, inplace=True)
    #(1, is yes)
    dataframe1['oneyeartenure'].replace([77, 99], 1, inplace=True)
    #(2, is mode)
    dataframe1['timeoffType'].replace([77, 99], 1, inplace=True)
    #(2, did not have because the availability is null)
    dataframe1['Familysupport'].replace([77, 99], 1, inplace=True)
    dataframe1['returntosamejob'].replace([77, 99], 1, inplace=True)
    dataframe1['fmlysprtAvailability'].replace([77, 99], 1, inplace=True)
    return dataframe1, 

Create a new summary to check the results.

Scale and Normalize

The report above shows that there is no missing values in the dataset. The only problem here is the 77 and 99 values. I used two different methods for dealing with these values.

  1. Numeric columns: Use clip values module to detect outliers and replace their values.
  2. Categorical columns: Use execute Python script module to replace the values.

For numeric columns ( all five weeks of pay columns and number of dependents) use clip values and replace the outlier with mean of the values.

Go to Data Transformation Category ➛ Scale and Reduce ➛ Clip Values. Drop the module into the design pane. Launch the columns selector and include all six columns to the module. For configurations, in properties pane adjust the following settings:

  • set of  thresholds ➛ ClipPeaks
  • Upper Threshold ➛ Constant
  • Constant value of upper threshold ➛ 52
  • Upper substitute value ➛ Mean
  • Tick the box to overwrite flag

After setting the configuration, run the module.

The final summary of the data shows skewness in the dataset.

This skewness can be result of features with different ranges. Normalization can help to scale data and solve this problem by changing the values to a common scale, without losing information.

To normalize the data from left pane side got to Data Transformation ➛ Scale and Reduce  ➛ Normalize Data. Drag and drop the module into the design pane. From properties pane for Transformation method select Logistic, leave the columns to transform as default which includes all numeric columns, and run the module. Create a new summary to compare the results.

As the result indicates the skewness has slightly improved.

Save the Data

As the result indicates the skewness has slightly improved. Now, save the cleaned data for back up and to use for modeling and analysis. To save the data, drag the Convert to CSV module from the Data Format Conversions  category to design pane and run the module. To download the CSV file, right-click the module ➛ results dataset ➛ download.

Share this page

Share on twitter
Share on linkedin

Leave a Reply

Your email address will not be published.