Data Analysis III - Validating Data

A favorite old saying of mine with respect to data quality is "Garbage In and Garbage Out". Your analysis is only ever as good as your dataset.
Pandas is a popular Python library used for data science and analysis. Used in conjunction with other data science toolsets like SciPy, NumPy, and Matplotlib, a modeler can create end-to-end analytic workflows to solve business problems.
My objective here is to go through some examples for how to perform basic data validation.


Goals: How to get a clean and valid dataset?
Sample file is a sales records - orginal file is with $1M rows, the sample file is with 100 rows
I plan to demonstrate how to validate data using a "dirty" dataset. Specifically, how to

- Detect/eliminate outlier data
- Deal with missing data
- Add default values
- Detect/eliminate duplicate data
- Remove incomplete data/rows
- Deal with error-prone columns
- Normalize data types
- Make bulk corrections to string/number values where appropriate
- Change casing
- Rename columns
- Save clean data to CSV or Excel file


Detect/Eliminate outlier data

In [2]:
## Checkout the basic structure of the data
df = pd.read_csv('data/analysis3/SalesRecords.csv')
df.head(4)
df.shape

## Look at the some basic stats for the ‘ItemType’ column
df.OrderPriority.describe()

## Select total profit greater than $1.6M
df[df['TotalProfit'] >1600000]
Out[2]:
OrderID Region Country ItemType SalesChannel OrderPriority OrderDate Ship Date UnitsSold UnitPrice UnitCost TotalRevenue TotalCost TotalProfit
0 443368995 Sub-Saharan Africa South Africa NaN NaN M 7/27/2012 7/28/2012 1593 9.33 6.92 14862.69 11023.56 3839.13
1 940995585 Australia and Oceania NaN Meat NaN M 5/15/2015 6/4/2015 360 NaN 364.69 151880.40 131288.40 20592.00
2 940995585 Australia and Oceania Papua New Guinea Meat NaN M 5/15/2015 6/4/2015 360 NaN 364.69 151880.40 131288.40 20592.00
3 174590194 Europe Slovakia Beverages Offline L 10/26/2016 12/4/2016 3973 47.45 31.79 188518.85 126301.67 62217.18
Out[2]:
(99, 14)
Out[2]:
count     99
unique     4
top        M
freq      35
Name: OrderPriority, dtype: object
Out[2]:
OrderID Region Country ItemType SalesChannel OrderPriority OrderDate Ship Date UnitsSold UnitPrice UnitCost TotalRevenue TotalCost TotalProfit
10 761723172 Middle East and North Africa Algeria Cosmetics Online M 2/18/2011 2/24/2011 9669 437.20 263.33 4227286.80 2546137.77 1681149.03
28 835696351 Europe Estonia Household Offline H 9/1/2011 10/21/2011 9976 668.27 502.54 6666661.52 5013339.04 1653322.48


Normalized data type

- When reading data from a CSV, some of the numbers will read in as strings instead of numeric values, or vice versa
- We can tell pandas to read the column as float or string
In [3]:
## This tells Pandas that the column ‘TotalProfit’ needs to be a float (real numbers and are written with a decimal point) value.
df = pd.read_csv('data/analysis3/SalesRecords.csv', dtype={'TotalProfit': float})                                                          
df.TotalProfit.describe()

## This tells Pandas that the column ‘OrderID’ needs to be a string and not a number
df = pd.read_csv('data/analysis3/SalesRecords.csv', dtype={'OrderID': str})
df.OrderID.describe()
Out[3]:
count    9.900000e+01
mean     3.992225e+05
std      4.345282e+05
min      3.188430e+03
25%      6.653090e+04
50%      2.360073e+05
75%      6.386010e+05
max      1.681149e+06
Name: TotalProfit, dtype: float64
Out[3]:
count            99
unique           91
top       198927056
freq              3
Name: OrderID, dtype: object


Add in a default value for the missing data

In [4]:
## Add in a default value for the missing data
## Detect missing data in Country columns and replaced "NaN" with "Missing"
df.Country = df.Country.fillna('Missing Country')
df.head(4)
Out[4]:
OrderID Region Country ItemType SalesChannel OrderPriority OrderDate Ship Date UnitsSold UnitPrice UnitCost TotalRevenue TotalCost TotalProfit
0 443368995 Sub-Saharan Africa South Africa NaN NaN M 7/27/2012 7/28/2012 1593 9.33 6.92 14862.69 11023.56 3839.13
1 940995585 Australia and Oceania Missing Country Meat NaN M 5/15/2015 6/4/2015 360 NaN 364.69 151880.40 131288.40 20592.00
2 940995585 Australia and Oceania Papua New Guinea Meat NaN M 5/15/2015 6/4/2015 360 NaN 364.69 151880.40 131288.40 20592.00
3 174590194 Europe Slovakia Beverages Offline L 10/26/2016 12/4/2016 3973 47.45 31.79 188518.85 126301.67 62217.18


Delete the rows that have missing data

In [5]:
## Drop all rows with any NA values
df.dropna().head(2)

## Drop rows that have all NA values
## Since we replace NA with "Missing", otherwise 2nd row should be dropped
df.dropna(how='all').head(2)

## Put a limitation on how many non-null values need to be in a row in order to keep it 
df.dropna(thresh=5).head(2)

## Origianal dataframe unchanged
df.shape

## Drop rows in SalesChannel column with NaN
df.dropna(subset=['SalesChannel']).head(2)
Out[5]:
OrderID Region Country ItemType SalesChannel OrderPriority OrderDate Ship Date UnitsSold UnitPrice UnitCost TotalRevenue TotalCost TotalProfit
3 174590194 Europe Slovakia Beverages Offline L 10/26/2016 12/4/2016 3973 47.45 31.79 188518.85 126301.67 62217.18
5 425793445 Sub-Saharan Africa Missing Country Beverages Online M 1/18/2013 2/16/2013 597 47.45 31.79 28327.65 18978.63 9349.02
Out[5]:
OrderID Region Country ItemType SalesChannel OrderPriority OrderDate Ship Date UnitsSold UnitPrice UnitCost TotalRevenue TotalCost TotalProfit
0 443368995 Sub-Saharan Africa South Africa NaN NaN M 7/27/2012 7/28/2012 1593 9.33 6.92 14862.69 11023.56 3839.13
1 940995585 Australia and Oceania Missing Country Meat NaN M 5/15/2015 6/4/2015 360 NaN 364.69 151880.40 131288.40 20592.00
Out[5]:
OrderID Region Country ItemType SalesChannel OrderPriority OrderDate Ship Date UnitsSold UnitPrice UnitCost TotalRevenue TotalCost TotalProfit
0 443368995 Sub-Saharan Africa South Africa NaN NaN M 7/27/2012 7/28/2012 1593 9.33 6.92 14862.69 11023.56 3839.13
1 940995585 Australia and Oceania Missing Country Meat NaN M 5/15/2015 6/4/2015 360 NaN 364.69 151880.40 131288.40 20592.00
Out[5]:
(99, 14)
Out[5]:
OrderID Region Country ItemType SalesChannel OrderPriority OrderDate Ship Date UnitsSold UnitPrice UnitCost TotalRevenue TotalCost TotalProfit
3 174590194 Europe Slovakia Beverages Offline L 10/26/2016 12/4/2016 3973 47.45 31.79 188518.85 126301.67 62217.18
4 174590194 Asia Sri Lanka NaN Online L 11/7/2011 12/18/2011 1379 9.33 6.92 12866.07 9542.68 3323.39


Delete the columns that have a high incidence of missing data

In [6]:
## Drop the columns with that are all NA values
df.dropna(axis=1, how='all').head(2)

## Drop all columns with any NA values - "Country" is the only column without any NA values
df.dropna(axis=1, how='any').head(2)
Out[6]:
OrderID Region Country ItemType SalesChannel OrderPriority OrderDate Ship Date UnitsSold UnitPrice UnitCost TotalRevenue TotalCost TotalProfit
0 443368995 Sub-Saharan Africa South Africa NaN NaN M 7/27/2012 7/28/2012 1593 9.33 6.92 14862.69 11023.56 3839.13
1 940995585 Australia and Oceania Missing Country Meat NaN M 5/15/2015 6/4/2015 360 NaN 364.69 151880.40 131288.40 20592.00
Out[6]:
OrderID Region Country OrderPriority OrderDate Ship Date UnitsSold UnitCost TotalCost TotalProfit
0 443368995 Sub-Saharan Africa South Africa M 7/27/2012 7/28/2012 1593 6.92 11023.56 3839.13
1 940995585 Australia and Oceania Missing Country M 5/15/2015 6/4/2015 360 364.69 131288.40 20592.00


Detect/eliminate duplicate data

- Identify which observations are duplicates
- List of duplicated data and save the list to a csv file
- Delete duplicated data
  - Return DataFrame with duplicate rows removed, optionally only considering certain columns
- Save clean data to a csv file
In [7]:
## Identify which observations are duplicates, Trua means duplicate
df.duplicated().head(10)
Out[7]:
0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8     True
9    False
dtype: bool
In [8]:
## List of duplicated
df_dup=df[df.duplicated()]
print("There are %s duplicate order IDs" % (df_dup.shape[0]))  
df_dup.head()

## Save the list of duplicated data to CSV
df_dup.to_csv('data/analysis3/SalesRecordsDup.csv')
There are 5 duplicate order IDs
Out[8]:
OrderID Region Country ItemType SalesChannel OrderPriority OrderDate Ship Date UnitsSold UnitPrice UnitCost TotalRevenue TotalCost TotalProfit
8 601245963 Sub-Saharan Africa Ghana Office Supplies Online L 3/23/2017 4/15/2017 896 651.21 524.96 583484.16 470364.16 113120.00
17 807785928 Sub-Saharan Africa Ethiopia Cosmetics Online M 7/7/2011 7/25/2011 662 437.20 263.33 289426.40 174324.46 115101.94
23 135178029 Europe United Kingdom Cosmetics Online L 5/1/2015 5/16/2015 1038 437.20 263.33 453813.60 273336.54 180477.06
25 198927056 Asia China Office Supplies Online M 2/10/2016 3/29/2016 5791 651.21 524.96 3771157.11 3040043.36 731113.75
26 198927056 Asia China Office Supplies Online M 2/10/2016 3/29/2016 5791 651.21 524.96 3771157.11 3040043.36 731113.75
In [9]:
## Delete duplicated data
df_clean= df.drop_duplicates(['OrderID'], keep='first', inplace = False)

## Save clean data to a csv file
df_clean.to_csv('data/analysis3/SalesRecordsClean.csv')


Change column to uppercase and remove trailing whitespce

In [10]:
## change all region to uppercase:
df['Region'].str.upper().tail(3)

## remove trailing whitespace
df['Country'].str.strip().tail(3)
Out[10]:
96       SUB-SAHARAN AFRICA
97    AUSTRALIA AND OCEANIA
98       SUB-SAHARAN AFRICA
Name: Region, dtype: object
Out[10]:
96     Malawi
97    Vanuatu
98       Mali
Name: Country, dtype: object


Rename column

In [11]:
df_rename = df.rename(columns = {'Region':'Continent', 'TotalProfit':'Net Income'})


Save the file

In [12]:
## save the file with new columns name to CSV file
df_rename.to_csv('data/analysis3/SalesRecordsRename.csv')