When we computed the mean, minimum and maximum of the data in Python, we are employing a technique to get information about the data distribution.
This can be an effective tool when we need to summarize datasets for analysis.

Goal: Apply Python functions to summarize a Dataset
Specifically, I plan to use Python to:

- Sum up columns or rows in python
- Get the maximum, mean, average value in the datasets 
- Filter data to isolate invalid data 
- Transpose columns
- Excel Index-Match function in python (more example at Blog Data Analysis IV)

Example - Dow Jones Index 30 Stocks' Dividends

In [3]:

## Using Dow 30 dividend data sourced from yahoo finance
df = pd.read_excel('data/analysis2/Dow.xlsx')
df.head(3)
df.shape  ## show how many rows and columns in the data

Out[3]:

	Stock Symbol	Company Name	Dividend Yield	Closing Price	Annualized Dividend	Ex-Div Date	Pay Date	50-day moving average	200-day moving average	Dif 50-day
0	WMT	Wal-Mart Stores	0.0242	85.91	2.08	2018-12-06	2019-01-02	82.35	78.6800	-3.56
1	V	Visa	0.0070	119.78	0.84	2018-02-15	2018-03-06	106.44	98.9627	-13.34
2	VZ	Verizon	0.0496	47.58	2.36	2018-04-09	2018-05-01	48.73	47.0300	1.15

Out[3]:

(30, 10)

To see the column headings, call the list function

In [3]:

list(df)

Out[3]:

['Stock Symbol',
 'Company Name',
 'Dividend Yield',
 'Closing Price',
 'Annualized Dividend',
 'Ex-Div Date',
 'Pay Date',
 '50-day moving average',
 '200-day moving average',
 'Dif 50-day']

To get more information about each column, we can use the info function

In [4]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 0 to 29
Data columns (total 10 columns):
Stock Symbol              30 non-null object
Company Name              30 non-null object
Dividend Yield            30 non-null float64
Closing Price             30 non-null float64
Annualized Dividend       30 non-null float64
Ex-Div Date               30 non-null datetime64[ns]
Pay Date                  30 non-null datetime64[ns]
50-day moving average     30 non-null float64
200-day moving average    30 non-null float64
Dif 50-day                30 non-null float64
dtypes: datetime64[ns](2), float64(6), object(2)
memory usage: 2.6+ KB

Which Stock Has the Maximum Dividend Yield? - the loc operation is used to define the subset of the data we wish to deal with

In [4]:

## Find the stock with the maximum dividend yield
max_div_stock=df.iloc[df["Dividend Yield"].idxmax()]
max_div_stock
print("The stock with the max dividend yield is %s with yield %s" % (max_div_stock['Company Name'],max_div_stock['Dividend Yield']))

Out[4]:

Stock Symbol                               VZ
Company Name                          Verizon
Dividend Yield                         0.0496
Closing Price                           47.58
Annualized Dividend                      2.36
Ex-Div Date               2018-04-09 00:00:00
Pay Date                  2018-05-01 00:00:00
50-day moving average                   48.73
200-day moving average                  47.03
Dif 50-day                               1.15
Name: 2, dtype: object

The stock with the max dividend yield is Verizon with yield 0.0496

Which Stock Has the Minimum Dividend Yield?

In [5]:

## Find the stock with the minimum dividend yield
min_div_stock=df.iloc[df["Dividend Yield"].idxmin()]
min_div_stock
print("The stock with the minimum dividend yield is %s with yield %s" % (min_div_stock['Company Name'],min_div_stock['Dividend Yield']))

Out[5]:

Stock Symbol                                V
Company Name                             Visa
Dividend Yield                          0.007
Closing Price                          119.78
Annualized Dividend                      0.84
Ex-Div Date               2018-02-15 00:00:00
Pay Date                  2018-03-06 00:00:00
50-day moving average                  106.44
200-day moving average                98.9627
Dif 50-day                             -13.34
Name: 1, dtype: object

The stock with the minimum dividend yield is Visa with yield 0.007

Which Stock Price is Currently Below 50-day Moving Average?

- Calculate the difference between 50-day moving average and the closing price 
- Use min() to show the the stock with the smallest 50-day average
- Filter the difference below zero, this showed a list of stocks below 50-day average

In [9]:

df["Dif 50-day"] = df["50-day moving average"] - df["Closing Price"]  
min_50day= df.iloc[df["Dif 50-day"].idxmin()]
min_50day
print("The stock with the minimum below 50-dday moving average is %s with difference of $%s" % (min_50day['Company Name'],min_50day['Dif 50-day']))
df[df["Dif 50-day"] <=0].head(3)
df[df["Dif 50-day"] <=0].shape

Out[9]:

Stock Symbol                               BA
Company Name                       Boeing Co.
Dividend Yield                         0.0209
Closing Price                          327.36
Annualized Dividend                      6.84
Ex-Div Date               2018-02-08 00:00:00
Pay Date                  2018-03-02 00:00:00
50-day moving average                  255.39
200-day moving average                218.144
Dif 50-day                             -71.97
Name: 26, dtype: object

The stock with the minimum below 50-dday moving average is Boeing Co. with difference of $-71.97

Out[9]:

	Stock Symbol	Company Name	Dividend Yield	Closing Price	Annualized Dividend	Ex-Div Date	Pay Date	50-day moving average	200-day moving average	Dif 50-day
0	WMT	Wal-Mart Stores	0.0242	85.91	2.08	2018-12-06	2019-01-02	82.35	78.6800	-3.56
1	V	Visa	0.0070	119.78	0.84	2018-02-15	2018-03-06	106.44	98.9627	-13.34
3	UNH	UnitedHealth Group	0.0135	221.90	3.00	2018-03-08	2018-03-20	198.85	187.4800	-23.05

Out[9]:

(20, 10)

Insert Column and Assign Value in Python Instead of in Excel
- Test Insert new column after columns 2 with 5% Commission Rate

In [6]:

df.insert(2,"Com Rate %", 5)
df.head(3)

Out[6]:

	Stock Symbol	Company Name	Com Rate %	Dividend Yield	Closing Price	Annualized Dividend	Ex-Div Date	Pay Date	50-day moving average	200-day moving average	Dif 50-day
0	WMT	Wal-Mart Stores	5	0.0242	85.91	2.08	2018-12-06	2019-01-02	82.35	78.6800	-3.56
1	V	Visa	5	0.0070	119.78	0.84	2018-02-15	2018-03-06	106.44	98.9627	-13.34
2	VZ	Verizon	5	0.0496	47.58	2.36	2018-04-09	2018-05-01	48.73	47.0300	1.15

Save the Calculation and the New Column to Excel

In [7]:

df.to_excel('data/analysis2/Dow2.xlsx')
df.tail(3)

Out[7]:

	Stock Symbol	Company Name	Com Rate %	Dividend Yield	Closing Price	Annualized Dividend	Ex-Div Date	Pay Date	50-day moving average	200-day moving average	Dif 50-day
27	AAPL	Apple Inc.	5	0.0146	172.44	2.52	2018-02-09	2018-02-15	156.641	152.6100	-15.799
28	AXP	American Express	5	0.0153	91.60	1.40	2018-04-05	2018-05-10	90.340	84.1304	-1.260
29	MMM	3M	5	0.0254	214.33	5.44	2018-02-15	2018-03-12	216.330	206.6900	2.000

- two new columns created in python is saved in the Excel file

Count Rows and Columns When Dealing With Huge Dataset

In [8]:

Count_Row=df.shape[0] 
Count_Col=df.shape[1] 
print("There are %s rows in this file" % (Count_Row))
print("There are %s columns in this file" % (Count_Col))

There are 30 rows in this file
There are 11 columns in this file

Sum up total for specific columns

In [9]:

sum_row=df[["Dividend Yield","Closing Price"]].sum()
sum_row

Out[9]:

Dividend Yield       0.7648
Closing Price     3513.1300
dtype: float64

Transpose columns - This code is very helpful for viewing the output

In [10]:

## Initially, think of this as similar to a pivot in excel - but more flexible
df_sum=pd.DataFrame(data=sum_row).T
df_sum

Out[10]:

	Dividend Yield	Closing Price
0	0.7648	3513.13

Excel Index-Match function in python
- More example at blog Data Analysis IV

In [11]:

x = pd.DataFrame(data=list(range(0, 10)), columns=["List"])
x
y = pd.DataFrame(data=[(-1, "Small"), (4, "Medium"), (7, "Large")], 
                         columns=["List", "Category"])
y
pd.cut(x.List, y.List.tolist()+[np.inf], labels=y.Category)

Out[11]:

	List
0	0
1	1
2	2
3	3
4	4
5	5
6	6
7	7
8	8
9	9

Out[11]:

	List	Category
0	-1	Small
1	4	Medium
2	7	Large

Out[11]:

0     Small
1     Small
2     Small
3     Small
4     Small
5    Medium
6    Medium
7    Medium
8     Large
9     Large
Name: List, dtype: category
Categories (3, object): [Small < Medium < Large]