Goal: Use Python to load data into a dataframe and get basic statistics.

Specifically, I plan to use Python to:

- Import an Excel dataset with 1k rows and 132 columns into a dataframe
- Get preliminary dataframe statistics: sum, mean, min/max 
- Visualize the dataframe as a time series graph

The 2 sample data files I will use are:

- "2015 USA Census Data" from gov.org (data/analysis1/sample.xlsx)
- "Historical daily price data for 9 stocks from 2010-2011" (data/analysis1/Stock_px.xlsx)

First Example - 2015 USA Census Data in Excel
- Python Pandas to Read Excel Files

In [2]:

## Import pandas to read Excel, head(3) to show the first 3 rows only, shape to show how many rows/columns
df = pd.read_excel('data/analysis1/sample.xlsx')
df.head(3)
df.shape

Out[2]:

	STATEFIPS	STATE	agi_stub	income1	income2	income3	income4	income5	income6	...	A10300	N11901	A11901	N11902	A11902	total
0	1	New Hampshere	1	836320	481570	109790	233260	455560	1356760	...	328469	61920	48150	732670	1933120	1427680
1	1	New Yok	2	494830	206630	146250	129390	275920	1010990	...	965011	73720	107304	415410	1187403	847710
2	1	Floria	3	261250	80720	139280	36130	155100	583910	...	1333418	64200	139598	193030	536699	481250

3 rows × 132 columns

Out[2]:

(101, 132)

Two Commands in Python Pandas to Read Excel Files

- pd = pd.read_excel("data/sample.xlsx") or
- pd = pd.ExcelFile("data/sample.xlsx")
- Both commands can read Excel file
- Use Pandas Timeit to time how long does it take to run the reports
- In general, the ExcelFile command will be faster

In [5]:

def test_excelfile():
    df = pd.ExcelFile("data/analysis1/sample.xlsx")  # this code is faster...use panda timeit library to compare 
    
def test_read_excel():
    df = pd.read_excel("data/analysis1/sample.xlsx")

print("pd.ExcelFile: time=")
%timeit test_excelfile()
    
print("pd.read_excel: time=")
%timeit test_read_excel()

pd.ExcelFile: time=
67.2 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
pd.read_excel: time=
86.5 ms ± 3.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

How to filter data in Python instead of in Excel?

- The following example filters and count ("shape") the dataframe for zipcode 35005
- This approach can easily scale to more complicated datasets, and provides results using Python pandas dataframe

In [7]:

df_filtered=df[df["zipcode"]==35005]

df_filtered.head(3)
df_filtered.shape

Out[7]:

	STATEFIPS	STATE	zipcode	agi_stub	income1	income2	income3	income4	income5	income6	...	A10300	N11901	A11901	N11902	A11902	total
12	1	VERMONT	35005	1	1350	750	190	390	730	2300	...	545	100	74	1200	3243	2290
13	1	New Hampshere	35005	2	980	370	230	350	530	1970	...	1679	100	116	880	2501	1580
14	1	New Yok	35005	3	490	150	210	120	290	1070	...	2373	100	226	390	1136	850

3 rows × 132 columns

Out[7]:

(6, 132)

Second Example - 9 stocks historical price from 2010-2011
- Import pandas and read the Excel file (specified the directory)

In [8]:

df_stock = pd.read_excel('data/analysis1/Stock_px.xlsx')
df_stock.head(3)
df_stock.shape  # find out how many rows and columns in the file

Out[8]:

	DATE	AA	AAPL	GE	IBM	JNJ	MSFT	PEP	SPX	XOM	Broker
0	2010-01-04	16.40	214.01	14.66	128.32	60.84	29.76	58.10	1132.99	66.16	XYZ Inc.
1	2010-01-05	15.89	214.38	14.74	126.77	60.14	29.77	58.80	1136.52	66.42	Fidelity
2	2010-01-06	16.72	210.97	14.66	125.94	60.63	29.58	58.21	1137.14	66.99	TD Bank

Out[8]:

(451, 11)

Insert a new column that is the sum of 3 existing columns

In [9]:

## Insert a new column and Show the last 3 rows of data
df_stock["total"] = df_stock["AA"] + df_stock["AAPL"] + df_stock["GE"]
df_stock.tail(3)

Out[9]:

	DATE	AA	AAPL	GE	IBM	JNJ	MSFT	PEP	SPX	XOM	Broker	total
448	2011-10-12	10.05	402.19	16.40	186.12	64.33	26.96	62.70	1207.25	77.16	XYZ Inc.	428.64
449	2011-10-13	10.10	408.43	16.22	186.82	64.23	27.18	62.36	1203.66	76.37	Fidelity	434.75
450	2011-10-14	10.26	422.00	16.60	190.53	64.72	27.27	62.24	1224.58	78.11	TD Bank	448.86

In [7]:

## Now get the sum / mean / min / max for stock AA
df_stock["AA"].sum(), df_stock["AA"].mean(),df_stock["AA"].min(),df_stock["AA"].max()

Out[7]:

(95433.239999999991, 17.44028508771932, 4.2000000000000002, 43.619999999999997)

Pandas Time Series Analysis: DatetimeIndex and Resample

- Change column DATE type from string to timeseries (see Pandas TimeSeries blog)
- Set column "DATE" as index

In [10]:

## Check the data type
type(df_stock.DATE[0])

## the type is a timestamp

Out[10]:

pandas._libs.tslib.Timestamp

In [11]:

## Reload the dataframe and set date as index 
df_stock_index = pd.read_excel('data/analysis1/Stock_px.xlsx', parse_dates=["DATE"], index_col="DATE")
df_stock_index.head(3)

Out[11]:

	AA	AAPL	GE	IBM	JNJ	MSFT	PEP	SPX	XOM	Broker
DATE
2010-01-04	16.40	214.01	14.66	128.32	60.84	29.76	58.10	1132.99	66.16	XYZ Inc.
2010-01-05	15.89	214.38	14.74	126.77	60.14	29.77	58.80	1136.52	66.42	Fidelity
2010-01-06	16.72	210.97	14.66	125.94	60.63	29.58	58.21	1137.14	66.99	TD Bank

In [12]:

## Now we use the dataframe resample method to create a line graph with multiple series
df_stock_index.resample('D').sum().plot()

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x21ef974dc50>