A favorite old saying of mine with respect to data quality is "Garbage In and Garbage Out". Your analysis is only ever as good as your dataset.
Pandas is a popular Python library used for data science and analysis. Used in conjunction with other data science toolsets like SciPy, NumPy, and Matplotlib, a modeler can create end-to-end analytic workflows to solve business problems.
My objective here is to go through some examples for how to perform basic data validation.

Goals: How to get a clean and valid dataset?
Sample file is a sales records - orginal file is with $1M rows, the sample file is with 100 rows
I plan to demonstrate how to validate data using a "dirty" dataset. Specifically, how to

- Detect/eliminate outlier data
- Deal with missing data
- Add default values
- Detect/eliminate duplicate data
- Remove incomplete data/rows
- Deal with error-prone columns
- Normalize data types
- Make bulk corrections to string/number values where appropriate
- Change casing
- Rename columns
- Save clean data to CSV or Excel file

Detect/Eliminate outlier data

In [2]:

## Checkout the basic structure of the data
df = pd.read_csv('data/analysis3/SalesRecords.csv')
df.head(4)
df.shape

## Look at the some basic stats for the ‘ItemType’ column
df.OrderPriority.describe()

## Select total profit greater than $1.6M
df[df['TotalProfit'] >1600000]

Out[2]:

	OrderID	Region	Country	ItemType	SalesChannel	OrderPriority	OrderDate	Ship Date	UnitsSold	UnitPrice	UnitCost	TotalRevenue	TotalCost	TotalProfit
0	443368995	Sub-Saharan Africa	South Africa	NaN	NaN	M	7/27/2012	7/28/2012	1593	9.33	6.92	14862.69	11023.56	3839.13
1	940995585	Australia and Oceania	NaN	Meat	NaN	M	5/15/2015	6/4/2015	360	NaN	364.69	151880.40	131288.40	20592.00
2	940995585	Australia and Oceania	Papua New Guinea	Meat	NaN	M	5/15/2015	6/4/2015	360	NaN	364.69	151880.40	131288.40	20592.00
3	174590194	Europe	Slovakia	Beverages	Offline	L	10/26/2016	12/4/2016	3973	47.45	31.79	188518.85	126301.67	62217.18

Out[2]:

(99, 14)

Out[2]:

count     99
unique     4
top        M
freq      35
Name: OrderPriority, dtype: object

Out[2]:

	OrderID	Region	Country	ItemType	SalesChannel	OrderPriority	OrderDate	Ship Date	UnitsSold	UnitPrice	UnitCost	TotalRevenue	TotalCost	TotalProfit
10	761723172	Middle East and North Africa	Algeria	Cosmetics	Online	M	2/18/2011	2/24/2011	9669	437.20	263.33	4227286.80	2546137.77	1681149.03
28	835696351	Europe	Estonia	Household	Offline	H	9/1/2011	10/21/2011	9976	668.27	502.54	6666661.52	5013339.04	1653322.48

Normalized data type

- When reading data from a CSV, some of the numbers will read in as strings instead of numeric values, or vice versa
- We can tell pandas to read the column as float or string

In [3]:

## This tells Pandas that the column ‘TotalProfit’ needs to be a float (real numbers and are written with a decimal point) value.
df = pd.read_csv('data/analysis3/SalesRecords.csv', dtype={'TotalProfit': float})                                                          
df.TotalProfit.describe()

## This tells Pandas that the column ‘OrderID’ needs to be a string and not a number
df = pd.read_csv('data/analysis3/SalesRecords.csv', dtype={'OrderID': str})
df.OrderID.describe()

Out[3]:

count    9.900000e+01
mean     3.992225e+05
std      4.345282e+05
min      3.188430e+03
25%      6.653090e+04
50%      2.360073e+05
75%      6.386010e+05
max      1.681149e+06
Name: TotalProfit, dtype: float64

Out[3]:

count            99
unique           91
top       198927056
freq              3
Name: OrderID, dtype: object

Add in a default value for the missing data

In [4]:

## Add in a default value for the missing data
## Detect missing data in Country columns and replaced "NaN" with "Missing"
df.Country = df.Country.fillna('Missing Country')
df.head(4)

Out[4]:

	OrderID	Region	Country	ItemType	SalesChannel	OrderPriority	OrderDate	Ship Date	UnitsSold	UnitPrice	UnitCost	TotalRevenue	TotalCost	TotalProfit
0	443368995	Sub-Saharan Africa	South Africa	NaN	NaN	M	7/27/2012	7/28/2012	1593	9.33	6.92	14862.69	11023.56	3839.13
1	940995585	Australia and Oceania	Missing Country	Meat	NaN	M	5/15/2015	6/4/2015	360	NaN	364.69	151880.40	131288.40	20592.00
2	940995585	Australia and Oceania	Papua New Guinea	Meat	NaN	M	5/15/2015	6/4/2015	360	NaN	364.69	151880.40	131288.40	20592.00
3	174590194	Europe	Slovakia	Beverages	Offline	L	10/26/2016	12/4/2016	3973	47.45	31.79	188518.85	126301.67	62217.18

Delete the rows that have missing data

In [5]:

## Drop all rows with any NA values
df.dropna().head(2)

## Drop rows that have all NA values
## Since we replace NA with "Missing", otherwise 2nd row should be dropped
df.dropna(how='all').head(2)

## Put a limitation on how many non-null values need to be in a row in order to keep it 
df.dropna(thresh=5).head(2)

## Origianal dataframe unchanged
df.shape

## Drop rows in SalesChannel column with NaN
df.dropna(subset=['SalesChannel']).head(2)

Out[5]:

	OrderID	Region	Country	ItemType	SalesChannel	OrderPriority	OrderDate	Ship Date	UnitsSold	UnitPrice	UnitCost	TotalRevenue	TotalCost	TotalProfit
3	174590194	Europe	Slovakia	Beverages	Offline	L	10/26/2016	12/4/2016	3973	47.45	31.79	188518.85	126301.67	62217.18
5	425793445	Sub-Saharan Africa	Missing Country	Beverages	Online	M	1/18/2013	2/16/2013	597	47.45	31.79	28327.65	18978.63	9349.02

Out[5]:

	OrderID	Region	Country	ItemType	SalesChannel	OrderPriority	OrderDate	Ship Date	UnitsSold	UnitPrice	UnitCost	TotalRevenue	TotalCost	TotalProfit
0	443368995	Sub-Saharan Africa	South Africa	NaN	NaN	M	7/27/2012	7/28/2012	1593	9.33	6.92	14862.69	11023.56	3839.13
1	940995585	Australia and Oceania	Missing Country	Meat	NaN	M	5/15/2015	6/4/2015	360	NaN	364.69	151880.40	131288.40	20592.00

Out[5]:

	OrderID	Region	Country	ItemType	SalesChannel	OrderPriority	OrderDate	Ship Date	UnitsSold	UnitPrice	UnitCost	TotalRevenue	TotalCost	TotalProfit
0	443368995	Sub-Saharan Africa	South Africa	NaN	NaN	M	7/27/2012	7/28/2012	1593	9.33	6.92	14862.69	11023.56	3839.13
1	940995585	Australia and Oceania	Missing Country	Meat	NaN	M	5/15/2015	6/4/2015	360	NaN	364.69	151880.40	131288.40	20592.00

Out[5]:

(99, 14)

Out[5]:

	OrderID	Region	Country	ItemType	SalesChannel	OrderPriority	OrderDate	Ship Date	UnitsSold	UnitPrice	UnitCost	TotalRevenue	TotalCost	TotalProfit
3	174590194	Europe	Slovakia	Beverages	Offline	L	10/26/2016	12/4/2016	3973	47.45	31.79	188518.85	126301.67	62217.18
4	174590194	Asia	Sri Lanka	NaN	Online	L	11/7/2011	12/18/2011	1379	9.33	6.92	12866.07	9542.68	3323.39

Delete the columns that have a high incidence of missing data

In [6]:

## Drop the columns with that are all NA values
df.dropna(axis=1, how='all').head(2)

## Drop all columns with any NA values - "Country" is the only column without any NA values
df.dropna(axis=1, how='any').head(2)

Out[6]:

	OrderID	Region	Country	ItemType	SalesChannel	OrderPriority	OrderDate	Ship Date	UnitsSold	UnitPrice	UnitCost	TotalRevenue	TotalCost	TotalProfit
0	443368995	Sub-Saharan Africa	South Africa	NaN	NaN	M	7/27/2012	7/28/2012	1593	9.33	6.92	14862.69	11023.56	3839.13
1	940995585	Australia and Oceania	Missing Country	Meat	NaN	M	5/15/2015	6/4/2015	360	NaN	364.69	151880.40	131288.40	20592.00

Out[6]:

	OrderID	Region	Country	OrderPriority	OrderDate	Ship Date	UnitsSold	UnitCost	TotalCost	TotalProfit
0	443368995	Sub-Saharan Africa	South Africa	M	7/27/2012	7/28/2012	1593	6.92	11023.56	3839.13
1	940995585	Australia and Oceania	Missing Country	M	5/15/2015	6/4/2015	360	364.69	131288.40	20592.00

Detect/eliminate duplicate data

- Identify which observations are duplicates
- List of duplicated data and save the list to a csv file
- Delete duplicated data
  - Return DataFrame with duplicate rows removed, optionally only considering certain columns
- Save clean data to a csv file

In [7]:

## Identify which observations are duplicates, Trua means duplicate
df.duplicated().head(10)

Out[7]:

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8     True
9    False
dtype: bool

In [8]:

## List of duplicated
df_dup=df[df.duplicated()]
print("There are %s duplicate order IDs" % (df_dup.shape[0]))  
df_dup.head()

## Save the list of duplicated data to CSV
df_dup.to_csv('data/analysis3/SalesRecordsDup.csv')

There are 5 duplicate order IDs

Out[8]:

	OrderID	Region	Country	ItemType	SalesChannel	OrderPriority	OrderDate	Ship Date	UnitsSold	UnitPrice	UnitCost	TotalRevenue	TotalCost	TotalProfit
8	601245963	Sub-Saharan Africa	Ghana	Office Supplies	Online	L	3/23/2017	4/15/2017	896	651.21	524.96	583484.16	470364.16	113120.00
17	807785928	Sub-Saharan Africa	Ethiopia	Cosmetics	Online	M	7/7/2011	7/25/2011	662	437.20	263.33	289426.40	174324.46	115101.94
23	135178029	Europe	United Kingdom	Cosmetics	Online	L	5/1/2015	5/16/2015	1038	437.20	263.33	453813.60	273336.54	180477.06
25	198927056	Asia	China	Office Supplies	Online	M	2/10/2016	3/29/2016	5791	651.21	524.96	3771157.11	3040043.36	731113.75
26	198927056	Asia	China	Office Supplies	Online	M	2/10/2016	3/29/2016	5791	651.21	524.96	3771157.11	3040043.36	731113.75

In [9]:

## Delete duplicated data
df_clean= df.drop_duplicates(['OrderID'], keep='first', inplace = False)

## Save clean data to a csv file
df_clean.to_csv('data/analysis3/SalesRecordsClean.csv')

Change column to uppercase and remove trailing whitespce

In [10]:

## change all region to uppercase:
df['Region'].str.upper().tail(3)

## remove trailing whitespace
df['Country'].str.strip().tail(3)

Out[10]:

96       SUB-SAHARAN AFRICA
97    AUSTRALIA AND OCEANIA
98       SUB-SAHARAN AFRICA
Name: Region, dtype: object

Out[10]:

96     Malawi
97    Vanuatu
98       Mali
Name: Country, dtype: object

Rename column

In [11]:

df_rename = df.rename(columns = {'Region':'Continent', 'TotalProfit':'Net Income'})

Save the file

In [12]:

## save the file with new columns name to CSV file
df_rename.to_csv('data/analysis3/SalesRecordsRename.csv')