Data Analysis III - Validating Data
A favorite old saying of mine with respect to data quality is "Garbage In and Garbage Out".
Your analysis is only ever as good as your dataset.
Pandas is a popular Python library used for data science and analysis. Used in conjunction with other data science toolsets like SciPy, NumPy, and Matplotlib, a modeler can create end-to-end analytic workflows to solve business problems.
My objective here is to go through some examples for how to perform basic data validation.
Pandas is a popular Python library used for data science and analysis. Used in conjunction with other data science toolsets like SciPy, NumPy, and Matplotlib, a modeler can create end-to-end analytic workflows to solve business problems.
My objective here is to go through some examples for how to perform basic data validation.
Goals: How to get a clean and valid dataset?
Sample file is a sales records - orginal file is with $1M rows, the sample file is with 100 rows
I plan to demonstrate how to validate data using a "dirty" dataset. Specifically, how to
- Detect/eliminate outlier data
- Deal with missing data
- Add default values
- Detect/eliminate duplicate data
- Remove incomplete data/rows
- Deal with error-prone columns
- Normalize data types
- Make bulk corrections to string/number values where appropriate
- Change casing
- Rename columns
- Save clean data to CSV or Excel file
Detect/Eliminate outlier data
In [2]:
## Checkout the basic structure of the data
df = pd.read_csv('data/analysis3/SalesRecords.csv')
df.head(4)
df.shape
## Look at the some basic stats for the ‘ItemType’ column
df.OrderPriority.describe()
## Select total profit greater than $1.6M
df[df['TotalProfit'] >1600000]
Out[2]:
Out[2]:
Out[2]:
Out[2]:
Normalized data type
- When reading data from a CSV, some of the numbers will read in as strings instead of numeric values, or vice versa
- We can tell pandas to read the column as float or string
In [3]:
## This tells Pandas that the column ‘TotalProfit’ needs to be a float (real numbers and are written with a decimal point) value.
df = pd.read_csv('data/analysis3/SalesRecords.csv', dtype={'TotalProfit': float})
df.TotalProfit.describe()
## This tells Pandas that the column ‘OrderID’ needs to be a string and not a number
df = pd.read_csv('data/analysis3/SalesRecords.csv', dtype={'OrderID': str})
df.OrderID.describe()
Out[3]:
Out[3]:
Add in a default value for the missing data
In [4]:
## Add in a default value for the missing data
## Detect missing data in Country columns and replaced "NaN" with "Missing"
df.Country = df.Country.fillna('Missing Country')
df.head(4)
Out[4]:
Delete the rows that have missing data
In [5]:
## Drop all rows with any NA values
df.dropna().head(2)
## Drop rows that have all NA values
## Since we replace NA with "Missing", otherwise 2nd row should be dropped
df.dropna(how='all').head(2)
## Put a limitation on how many non-null values need to be in a row in order to keep it
df.dropna(thresh=5).head(2)
## Origianal dataframe unchanged
df.shape
## Drop rows in SalesChannel column with NaN
df.dropna(subset=['SalesChannel']).head(2)
Out[5]:
Out[5]:
Out[5]:
Out[5]:
Out[5]:
Delete the columns that have a high incidence of missing data
In [6]:
## Drop the columns with that are all NA values
df.dropna(axis=1, how='all').head(2)
## Drop all columns with any NA values - "Country" is the only column without any NA values
df.dropna(axis=1, how='any').head(2)
Out[6]:
Out[6]:
Detect/eliminate duplicate data
- Identify which observations are duplicates
- List of duplicated data and save the list to a csv file
- Delete duplicated data
- Return DataFrame with duplicate rows removed, optionally only considering certain columns
- Save clean data to a csv file
In [7]:
## Identify which observations are duplicates, Trua means duplicate
df.duplicated().head(10)
Out[7]:
In [8]:
## List of duplicated
df_dup=df[df.duplicated()]
print("There are %s duplicate order IDs" % (df_dup.shape[0]))
df_dup.head()
## Save the list of duplicated data to CSV
df_dup.to_csv('data/analysis3/SalesRecordsDup.csv')
Out[8]:
In [9]:
## Delete duplicated data
df_clean= df.drop_duplicates(['OrderID'], keep='first', inplace = False)
## Save clean data to a csv file
df_clean.to_csv('data/analysis3/SalesRecordsClean.csv')
Change column to uppercase and remove trailing whitespce
In [10]:
## change all region to uppercase:
df['Region'].str.upper().tail(3)
## remove trailing whitespace
df['Country'].str.strip().tail(3)
Out[10]:
Out[10]:
Rename column
In [11]:
df_rename = df.rename(columns = {'Region':'Continent', 'TotalProfit':'Net Income'})
Save the file
In [12]:
## save the file with new columns name to CSV file
df_rename.to_csv('data/analysis3/SalesRecordsRename.csv')