Data Analysis II - Summarizing Data
When we computed the mean, minimum and maximum of the data in Python, we are employing a technique to get information about the data distribution.
This can be an effective tool when we need to summarize datasets for analysis.
Goal: Apply Python functions to summarize a Dataset
Specifically, I plan to use Python to:
- Sum up columns or rows in python
- Get the maximum, mean, average value in the datasets
- Filter data to isolate invalid data
- Transpose columns
- Excel Index-Match function in python (more example at Blog Data Analysis IV)
Example - Dow Jones Index 30 Stocks' Dividends
## Using Dow 30 dividend data sourced from yahoo finance
df = pd.read_excel('data/analysis2/Dow.xlsx')
df.head(3)
df.shape ## show how many rows and columns in the data
To see the column headings, call the list function
list(df)
To get more information about each column, we can use the info function
df.info()
Which Stock Has the Maximum Dividend Yield? - the loc operation is used to define the subset of the data we wish to deal with
## Find the stock with the maximum dividend yield
max_div_stock=df.iloc[df["Dividend Yield"].idxmax()]
max_div_stock
print("The stock with the max dividend yield is %s with yield %s" % (max_div_stock['Company Name'],max_div_stock['Dividend Yield']))
Which Stock Has the Minimum Dividend Yield?
## Find the stock with the minimum dividend yield
min_div_stock=df.iloc[df["Dividend Yield"].idxmin()]
min_div_stock
print("The stock with the minimum dividend yield is %s with yield %s" % (min_div_stock['Company Name'],min_div_stock['Dividend Yield']))
Which Stock Price is Currently Below 50-day Moving Average?
- Calculate the difference between 50-day moving average and the closing price
- Use min() to show the the stock with the smallest 50-day average
- Filter the difference below zero, this showed a list of stocks below 50-day average
df["Dif 50-day"] = df["50-day moving average"] - df["Closing Price"]
min_50day= df.iloc[df["Dif 50-day"].idxmin()]
min_50day
print("The stock with the minimum below 50-dday moving average is %s with difference of $%s" % (min_50day['Company Name'],min_50day['Dif 50-day']))
df[df["Dif 50-day"] <=0].head(3)
df[df["Dif 50-day"] <=0].shape
Insert Column and Assign Value in Python Instead of in Excel
- Test Insert new column after columns 2 with 5% Commission Rate
df.insert(2,"Com Rate %", 5)
df.head(3)
Save the Calculation and the New Column to Excel
df.to_excel('data/analysis2/Dow2.xlsx')
df.tail(3)
- two new columns created in python is saved in the Excel file
Count Rows and Columns When Dealing With Huge Dataset
Count_Row=df.shape[0]
Count_Col=df.shape[1]
print("There are %s rows in this file" % (Count_Row))
print("There are %s columns in this file" % (Count_Col))
Sum up total for specific columns
sum_row=df[["Dividend Yield","Closing Price"]].sum()
sum_row
Transpose columns - This code is very helpful for viewing the output
## Initially, think of this as similar to a pivot in excel - but more flexible
df_sum=pd.DataFrame(data=sum_row).T
df_sum
Excel Index-Match function in python
- More example at blog Data Analysis IV
x = pd.DataFrame(data=list(range(0, 10)), columns=["List"])
x
y = pd.DataFrame(data=[(-1, "Small"), (4, "Medium"), (7, "Large")],
columns=["List", "Category"])
y
pd.cut(x.List, y.List.tolist()+[np.inf], labels=y.Category)