Data Analysis I - Sourcing Data
Goal: Use Python to load data into a dataframe and get basic statistics.
Specifically, I plan to use Python to:
- Import an Excel dataset with 1k rows and 132 columns into a dataframe
- Get preliminary dataframe statistics: sum, mean, min/max
- Visualize the dataframe as a time series graph
The 2 sample data files I will use are:
- "2015 USA Census Data" from gov.org (data/analysis1/sample.xlsx)
- "Historical daily price data for 9 stocks from 2010-2011" (data/analysis1/Stock_px.xlsx)
First Example - 2015 USA Census Data in Excel
- Python Pandas to Read Excel Files
In [2]:
## Import pandas to read Excel, head(3) to show the first 3 rows only, shape to show how many rows/columns
df = pd.read_excel('data/analysis1/sample.xlsx')
df.head(3)
df.shape
Out[2]:
Out[2]:
Two Commands in Python Pandas to Read Excel Files
- pd = pd.read_excel("data/sample.xlsx") or
- pd = pd.ExcelFile("data/sample.xlsx")
- Both commands can read Excel file
- Use Pandas Timeit to time how long does it take to run the reports
- In general, the ExcelFile command will be faster
In [5]:
def test_excelfile():
df = pd.ExcelFile("data/analysis1/sample.xlsx") # this code is faster...use panda timeit library to compare
def test_read_excel():
df = pd.read_excel("data/analysis1/sample.xlsx")
print("pd.ExcelFile: time=")
%timeit test_excelfile()
print("pd.read_excel: time=")
%timeit test_read_excel()
How to filter data in Python instead of in Excel?
- The following example filters and count ("shape") the dataframe for zipcode 35005
- This approach can easily scale to more complicated datasets, and provides results using Python pandas dataframe
In [7]:
df_filtered=df[df["zipcode"]==35005]
df_filtered.head(3)
df_filtered.shape
Out[7]:
Out[7]:
Second Example - 9 stocks historical price from 2010-2011
- Import pandas and read the Excel file (specified the directory)
In [8]:
df_stock = pd.read_excel('data/analysis1/Stock_px.xlsx')
df_stock.head(3)
df_stock.shape # find out how many rows and columns in the file
Out[8]:
Out[8]:
Insert a new column that is the sum of 3 existing columns
In [9]:
## Insert a new column and Show the last 3 rows of data
df_stock["total"] = df_stock["AA"] + df_stock["AAPL"] + df_stock["GE"]
df_stock.tail(3)
Out[9]:
In [7]:
## Now get the sum / mean / min / max for stock AA
df_stock["AA"].sum(), df_stock["AA"].mean(),df_stock["AA"].min(),df_stock["AA"].max()
Out[7]:
Pandas Time Series Analysis: DatetimeIndex and Resample
- Change column DATE type from string to timeseries (see Pandas TimeSeries blog)
- Set column "DATE" as index
In [10]:
## Check the data type
type(df_stock.DATE[0])
## the type is a timestamp
Out[10]:
In [11]:
## Reload the dataframe and set date as index
df_stock_index = pd.read_excel('data/analysis1/Stock_px.xlsx', parse_dates=["DATE"], index_col="DATE")
df_stock_index.head(3)
Out[11]:
In [12]:
## Now we use the dataframe resample method to create a line graph with multiple series
df_stock_index.resample('D').sum().plot()
Out[12]: