Data Visualization I - Python
Data visualizations and static or interactive visualizations are one of the data analysis tools to helping identify outliers or data transformations and ideals for financial models.
The maximum number of data series per chart in Excel is 255, so using Python to plot a huge dataset is very useful.
Goal: Use Python to read Excel file and plot visualization chart for financial analysis
- Example file - Visualize popluation and life expectancy comparison between US and Japan from 1960 to 2017
Visualize the data
- Compare US and Japan life Expectancy from 1960 to 2017
- After initial data analysis, sometime we need to reshape the data for further analaysis
- We can reshape the data format to set index for visualization
- The maximum number of data series per chart in Excel is 255, this data has 264 columns as data series
First Example - Life Expectancy for 265 Countries from 1960 to 2017
- Select specific columns for plot
## Read the file, but don't set index since we will define 'Year' as x-axix
df = pd.read_excel('data/Plot1/LifeChart.xlsx')
year = df['Year']
country = df.loc[:,'Aruba':'Zimbabwe']
print (country[0:2])
Count_Row=df.shape[0]
Count_Col=df.shape[1]
print("There are %s years in this file" % (Count_Row))
print("There are %s countries in this file" % (Count_Col))
Compare Japan and US life Expectancy from 1960 to 2017
- Save the chart as png file, then use the chart for analysis or presentation
- It's effective to chart the dataset when dealing with huge dataset since Excel limit 255 data series
plt.figure(1)
plt.plot(year, country['Japan'], 'y-')
plt.plot(year, country['United_States'], 'b.')
plt.legend(['Japan Life Expectancy','US Life Expectancy'])
plt.xlabel('Year')
plt.ylabel('Life Expectancy')
plt.savefig('data/Plot1/JapanLifeExpectancy.png')
Second Example - Population Growth for US and Japan from 1960 to 2017
- Plot population in million
- How to fix scientific notation le8
- Plot population growth rate to visualize the comparsion
Read the file
- To see the column headings, call the list function and also get a glimpse of the data
data = pd.read_excel('data/Plot1/USALife.xlsx')
data[data.Country == 'USA']
us = data[data.Country == 'USA']
japan = data[data.Country == 'Japan']
data.head(2) # get the glimpse of the data
list(data) # see the columns headings
Plot shows scientific notation le8 on the top left corner
- 1e8 is standard scientific notion, and it indicates an overall scale factor for the y-axis is too big
plt.plot(us.Year, us.Population, 'b')
plt.plot(japan.Year, japan.Population, 'r')
plt.legend(['USA Population', 'Japan Population'])
plt.xlabel('Year')
plt.ylabel('Population')
plt.show()
Plot a dot line graph with y-axis in million
- million is with 6 zero
plt.plot(us.Year, us.Population / 10**6, 'b.')
plt.plot(japan.Year, japan.Population / 10**6, 'r.')
plt.legend(['USA Population', 'Japan Population'])
plt.xlabel('Year')
plt.ylabel('Population (in million)')
plt.show()
Plot population growth rate - visiualize the growth
- Japan population growth rate is starting to be lower than USA from mid-1980
- Take the first year 1960 as first year, then divide to the following year for growth rate
- Based on the chart, we can see USA popluation grow faster pace than Japan from mid 1990s
plt.plot(us.Year, us.Population / us.Population.iloc[0] * 100, 'b')
plt.plot(japan.Year, japan.Population / japan.Population.iloc[0] * 100, 'r')
plt.legend(['USA Population Growth Rate', 'Japan Population Growth Rate'])
plt.xlabel('Year')
plt.ylabel('Population growth (first year = 100)')
plt.show()