Data Analysis II - Summarizing Data

When we computed the mean, minimum and maximum of the data in Python, we are employing a technique to get information about the data distribution.
This can be an effective tool when we need to summarize datasets for analysis.


Goal: Apply Python functions to summarize a Dataset
Specifically, I plan to use Python to:

- Sum up columns or rows in python
- Get the maximum, mean, average value in the datasets 
- Filter data to isolate invalid data 
- Transpose columns
- Excel Index-Match function in python (more example at Blog Data Analysis IV)    


Example - Dow Jones Index 30 Stocks' Dividends

In [3]:
## Using Dow 30 dividend data sourced from yahoo finance
df = pd.read_excel('data/analysis2/Dow.xlsx')
df.head(3)
df.shape  ## show how many rows and columns in the data
Out[3]:
Stock Symbol Company Name Dividend Yield Closing Price Annualized Dividend Ex-Div Date Pay Date 50-day moving average 200-day moving average Dif 50-day
0 WMT Wal-Mart Stores 0.0242 85.91 2.08 2018-12-06 2019-01-02 82.35 78.6800 -3.56
1 V Visa 0.0070 119.78 0.84 2018-02-15 2018-03-06 106.44 98.9627 -13.34
2 VZ Verizon 0.0496 47.58 2.36 2018-04-09 2018-05-01 48.73 47.0300 1.15
Out[3]:
(30, 10)


To see the column headings, call the list function

In [3]:
list(df)
Out[3]:
['Stock Symbol',
 'Company Name',
 'Dividend Yield',
 'Closing Price',
 'Annualized Dividend',
 'Ex-Div Date',
 'Pay Date',
 '50-day moving average',
 '200-day moving average',
 'Dif 50-day']


To get more information about each column, we can use the info function

In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 0 to 29
Data columns (total 10 columns):
Stock Symbol              30 non-null object
Company Name              30 non-null object
Dividend Yield            30 non-null float64
Closing Price             30 non-null float64
Annualized Dividend       30 non-null float64
Ex-Div Date               30 non-null datetime64[ns]
Pay Date                  30 non-null datetime64[ns]
50-day moving average     30 non-null float64
200-day moving average    30 non-null float64
Dif 50-day                30 non-null float64
dtypes: datetime64[ns](2), float64(6), object(2)
memory usage: 2.6+ KB


Which Stock Has the Maximum Dividend Yield? - the loc operation is used to define the subset of the data we wish to deal with

In [4]:
## Find the stock with the maximum dividend yield
max_div_stock=df.iloc[df["Dividend Yield"].idxmax()]
max_div_stock
print("The stock with the max dividend yield is %s with yield %s" % (max_div_stock['Company Name'],max_div_stock['Dividend Yield']))
Out[4]:
Stock Symbol                               VZ
Company Name                          Verizon
Dividend Yield                         0.0496
Closing Price                           47.58
Annualized Dividend                      2.36
Ex-Div Date               2018-04-09 00:00:00
Pay Date                  2018-05-01 00:00:00
50-day moving average                   48.73
200-day moving average                  47.03
Dif 50-day                               1.15
Name: 2, dtype: object
The stock with the max dividend yield is Verizon with yield 0.0496


Which Stock Has the Minimum Dividend Yield?

In [5]:
## Find the stock with the minimum dividend yield
min_div_stock=df.iloc[df["Dividend Yield"].idxmin()]
min_div_stock
print("The stock with the minimum dividend yield is %s with yield %s" % (min_div_stock['Company Name'],min_div_stock['Dividend Yield']))
Out[5]:
Stock Symbol                                V
Company Name                             Visa
Dividend Yield                          0.007
Closing Price                          119.78
Annualized Dividend                      0.84
Ex-Div Date               2018-02-15 00:00:00
Pay Date                  2018-03-06 00:00:00
50-day moving average                  106.44
200-day moving average                98.9627
Dif 50-day                             -13.34
Name: 1, dtype: object
The stock with the minimum dividend yield is Visa with yield 0.007


Which Stock Price is Currently Below 50-day Moving Average?

- Calculate the difference between 50-day moving average and the closing price 
- Use min() to show the the stock with the smallest 50-day average
- Filter the difference below zero, this showed a list of stocks below 50-day average
In [9]:
df["Dif 50-day"] = df["50-day moving average"] - df["Closing Price"]  
min_50day= df.iloc[df["Dif 50-day"].idxmin()]
min_50day
print("The stock with the minimum below 50-dday moving average is %s with difference of $%s" % (min_50day['Company Name'],min_50day['Dif 50-day']))
df[df["Dif 50-day"] <=0].head(3)
df[df["Dif 50-day"] <=0].shape
Out[9]:
Stock Symbol                               BA
Company Name                       Boeing Co.
Dividend Yield                         0.0209
Closing Price                          327.36
Annualized Dividend                      6.84
Ex-Div Date               2018-02-08 00:00:00
Pay Date                  2018-03-02 00:00:00
50-day moving average                  255.39
200-day moving average                218.144
Dif 50-day                             -71.97
Name: 26, dtype: object
The stock with the minimum below 50-dday moving average is Boeing Co. with difference of $-71.97
Out[9]:
Stock Symbol Company Name Dividend Yield Closing Price Annualized Dividend Ex-Div Date Pay Date 50-day moving average 200-day moving average Dif 50-day
0 WMT Wal-Mart Stores 0.0242 85.91 2.08 2018-12-06 2019-01-02 82.35 78.6800 -3.56
1 V Visa 0.0070 119.78 0.84 2018-02-15 2018-03-06 106.44 98.9627 -13.34
3 UNH UnitedHealth Group 0.0135 221.90 3.00 2018-03-08 2018-03-20 198.85 187.4800 -23.05
Out[9]:
(20, 10)


Insert Column and Assign Value in Python Instead of in Excel
- Test Insert new column after columns 2 with 5% Commission Rate

In [6]:
df.insert(2,"Com Rate %", 5)
df.head(3)
Out[6]:
Stock Symbol Company Name Com Rate % Dividend Yield Closing Price Annualized Dividend Ex-Div Date Pay Date 50-day moving average 200-day moving average Dif 50-day
0 WMT Wal-Mart Stores 5 0.0242 85.91 2.08 2018-12-06 2019-01-02 82.35 78.6800 -3.56
1 V Visa 5 0.0070 119.78 0.84 2018-02-15 2018-03-06 106.44 98.9627 -13.34
2 VZ Verizon 5 0.0496 47.58 2.36 2018-04-09 2018-05-01 48.73 47.0300 1.15


Save the Calculation and the New Column to Excel

In [7]:
df.to_excel('data/analysis2/Dow2.xlsx')
df.tail(3)   
Out[7]:
Stock Symbol Company Name Com Rate % Dividend Yield Closing Price Annualized Dividend Ex-Div Date Pay Date 50-day moving average 200-day moving average Dif 50-day
27 AAPL Apple Inc. 5 0.0146 172.44 2.52 2018-02-09 2018-02-15 156.641 152.6100 -15.799
28 AXP American Express 5 0.0153 91.60 1.40 2018-04-05 2018-05-10 90.340 84.1304 -1.260
29 MMM 3M 5 0.0254 214.33 5.44 2018-02-15 2018-03-12 216.330 206.6900 2.000

- two new columns created in python is saved in the Excel file


Count Rows and Columns When Dealing With Huge Dataset

In [8]:
Count_Row=df.shape[0] 
Count_Col=df.shape[1] 
print("There are %s rows in this file" % (Count_Row))
print("There are %s columns in this file" % (Count_Col))
There are 30 rows in this file
There are 11 columns in this file


Sum up total for specific columns

In [9]:
sum_row=df[["Dividend Yield","Closing Price"]].sum()
sum_row
Out[9]:
Dividend Yield       0.7648
Closing Price     3513.1300
dtype: float64


Transpose columns - This code is very helpful for viewing the output

In [10]:
## Initially, think of this as similar to a pivot in excel - but more flexible
df_sum=pd.DataFrame(data=sum_row).T
df_sum 
Out[10]:
Dividend Yield Closing Price
0 0.7648 3513.13


Excel Index-Match function in python
- More example at blog Data Analysis IV

In [11]:
x = pd.DataFrame(data=list(range(0, 10)), columns=["List"])
x
y = pd.DataFrame(data=[(-1, "Small"), (4, "Medium"), (7, "Large")], 
                         columns=["List", "Category"])
y
pd.cut(x.List, y.List.tolist()+[np.inf], labels=y.Category)
Out[11]:
List
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
Out[11]:
List Category
0 -1 Small
1 4 Medium
2 7 Large
Out[11]:
0     Small
1     Small
2     Small
3     Small
4     Small
5    Medium
6    Medium
7    Medium
8     Large
9     Large
Name: List, dtype: category
Categories (3, object): [Small < Medium < Large]