Sometimes the way that data is stored in files or databases is not the way you need it for a data processing application.
Pandas along with the Python standard library provide us with a high-level, flexible, and high-performance set of core manipulations
and algorithms to enable us to wrangle data into the right form without much trouble.

Goals: How to merge/reshape data?
Data Overview:

 - View Data with hidden tabs
 - Combining and Merging Data sets by index
 - Delete/Remove duplicated data after merging

Data Transformation:

- Detecting and Filtering Outliers 
- Reshaping  
- Discretization and Binning to group data
- Index matching data

Visualize Data:

- Group by function
- Pivoting data
- Set index  
- Plot data

Example - Reinsurance Premium Loss Data

- The example is to merge and reshape data from different applications and ultimately determing business reinsurance coverage.
- This is a simplified and hypothetical example, but it's a great illustration of what Python can do for financial analysis. 
- The example data is generated by randomization function
   - ie: df = pd.DataFrame(np.random.randn(10,5),columns = ['A','B','C'])
- The input data is an excel file consisting of 2 tabs: premium and loss and 8 hidden tabs

View Excel File with Mulitple Tabs

- List of tabs : note the display list including hidden tabs
- Number of tabs: using "len" function to count how many tabs in the Excel file
- View each tabs
- We can use Python to view/list/count all the tabs including hidden tabs.

Understand the business, what is abnormal value for this business?

- If average premium per policy is between 10K to 200K, the premium outside this range is invalid
- If maximum ceded reinsurance recoverable is 500K, then any ceded recoverable greater than 500K is invalid

In [3]:

## View list of tabs including hidden tab
df = pd.ExcelFile('data/analysis4/Premium_Loss.xlsx')
print ('df.sheet_names=%s' % df.sheet_names)

## Count how many tabs in the Excel file 
number_tab = len(df.sheet_names)
print("There are %s tab(s) in the file." % (number_tab))

df.sheet_names=['premium', 'loss', 'hidden1', 'hidden2', 'hidden3', 'hidden4', 'hidden5', 'hidden6', 'hidden7', 'hidden8', 'hidden9', 'hidden10']
There are 12 tab(s) in the file.

In [4]:

## Detect duplicated data from the first tab -"premium"
df_premium = pd.read_excel('data/analysis4/Premium_Loss.xlsx', sheet_name = "premium")
print("There are %s policies" % (df_premium.shape[0]))  

df_prem_dupes=df_premium[df_premium.duplicated()]
df_prem_dupes.head()
print("There are %s duplicate policies" % (df_prem_dupes.shape[0]))

There are 200 policies

Out[4]:

	Policy Number	First Name	Last Name	Policy Effective Date	LOB	Policy Type	Premium	Coverage	Excess Coverage	Excess Policy
2	10962	Taisha	Whack	2017-09-19	Dentist	OCC	11308	1000000	3900000	91809.0
4	11339	Ricarda	Hotaka	2017-12-25	Podiatrist	OCC	2651	1000000	3900000	91367.0
8	11106	NaN	Sarasota Hospital	2017-08-14	Hospital	OCC	446997	1000000	3900000	97303.0

There are 3 duplicate policies

In [5]:

## View data from the second tab -"loss"
df_loss = pd.read_excel('data/analysis4/Premium_Loss.xlsx', sheet_name = "loss") 

## Slicing can achieve the same results as head()
df_loss[0:5]  


Count_Row=df_loss.shape[0] 
print("There are %s claims" % (Count_Row))

Out[5]:

	Policy Number	Claim Number	Policy Type	First Name	Last Name	LossDate	LAE	Loss	Premium
0	35033	880247	OCC	Adolph	Iwan	2017-03-17	84373.0	10000000	176966.0
1	35033	880247	OCC	Adolph	Iwan	2017-03-17	84373.0	5000000	176966.0
2	47755	880255	OCC	Cherryl	Aven	2017-11-25	88205.0	4000000	116133.0
3	47755	880255	OCC	Cherryl	Aven	2017-11-25	88205.0	4000000	116133.0
4	36838	880258	OCC	Franklin	Maen	2017-03-18	36440.0	54880	65872.0

There are 216 claims

Merge premium and loss tabs

- Use Merge function with key identifier "Policy Number"
- Pandas.merge connects rows in DataFrames based on one or more keys
- By default merge does an 'inner' join - the keys in the result are the intersection with prepend X and Y
- Save the merge file to Excel

In [6]:

## Merge tabs - Inner Join
df_premium = df.parse("premium")
df_loss = df.parse("loss")
df_merge = pd.merge(df_premium, df_loss, on='Policy Number')
print("There are %s policies with incurred losses - here are the first 2:" % (df_merge.shape[0]))  
df_merge.head(2)

## Save the merge file to Excel
df_merge.to_excel('data/analysis4/MergePremiiumLoss.xlsx')

There are 24 policies with incurred losses - here are the first 2:

Out[6]:

	Policy Number	First Name_x	Last Name_x	Policy Effective Date	LOB	Policy Type_x	Premium_x	Coverage	Excess Coverage	Excess Policy	Claim Number	Policy Type_y	First Name_y	Last Name_y	LossDate	LAE	Loss	Premium_y
0	10880	Kurtis	Dumm	2017-07-08	Phy	CM	232058	1000000	3900000	96289.0	880206	CM	Kurtis	Dumm	2017-09-24	74793.0	250000	232058.0
1	10948	Florencia	Bilyk	2017-06-05	Podiatrist	CM	5132	1000000	3900000	99671.0	880208	CM	Florencia	Bilyk	2018-04-22	38573.0	800000	5132.0

Combined tabs - Outer Join

- Other possible options beside "inner" join are 'left', 'right', and 'outer'. 
- The outer join takes the union of the keys, combining the effect of applying both left and right joins
- ie: Policy 10962 has no associated loss data, the missing loss result marks as NaN

In [7]:

## Combined tabs - Outer Join
df_combine = pd.merge(df_premium, df_loss, on='Policy Number', how='outer')
print("Here are the listing of %s policies incurred losses history - here are the first 2:" % (df_combine.shape[0])) 
df_combine.head(5)

## Save the combine file to Excel
df_combine.to_excel('data/analysis4/CombinePremiiumLoss.xlsx')

Here are the listing of 392 policies incurred losses history - here are the first 2:

Out[7]:

	Policy Number	First Name_x	Last Name_x	Policy Effective Date	LOB	Policy Type_x	Premium_x	Coverage	Excess Coverage	Excess Policy	Claim Number	Policy Type_y	First Name_y	Last Name_y	LossDate	LAE	Loss	Premium_y
0	10880	Kurtis	Dumm	2017-07-08	Phy	CM	232058.0	1000000.0	3900000.0	96289.0	880206.0	CM	Kurtis	Dumm	2017-09-24	74793.0	250000.0	232058.0
1	10962	Taisha	Whack	2017-09-19	Dentist	OCC	11308.0	1000000.0	3900000.0	91809.0	NaN	NaN	NaN	NaN	NaT	NaN	NaN	NaN
2	10962	Taisha	Whack	2017-09-19	Dentist	OCC	11308.0	1000000.0	3900000.0	91809.0	NaN	NaN	NaN	NaN	NaT	NaN	NaN	NaN
3	11339	Ricarda	Hotaka	2017-12-25	Podiatrist	OCC	2651.0	1000000.0	3900000.0	91367.0	NaN	NaN	NaN	NaN	NaT	NaN	NaN	NaN
4	11339	Ricarda	Hotaka	2017-12-25	Podiatrist	OCC	2651.0	1000000.0	3900000.0	91367.0	NaN	NaN	NaN	NaN	NaT	NaN	NaN	NaN

Delete/Removing Duplicates after merging data

- The DataFrame method Duplicated returns a boolean series
- True means duplicated data
- drop_duplicates returns a DataFrame where the duplicated array is True
     - Our example showing duplicated policy 10962 and 11339, the output showed as "True"
- Both of these methods by default consider all of the columns must be duplicated

In [7]:

df_combine.duplicated().head(10)

Out[7]:

0    False
1    False
2     True
3    False
4     True
5    False
6    False
7    False
8    False
9    False
dtype: bool

In [8]:

## View duplicated data after merging
df_dupes=df_combine[df_combine.duplicated()]
print("There are %s duplicate policies - here is the list:" % (df_dupes.shape[0])) 
df_dupes

There are 2 duplicate policies - here is the list:

Out[8]:

	Policy Number	First Name_x	Last Name_x	Policy Effective Date	LOB	Policy Type_x	Premium_x	Coverage	Excess Coverage	Excess Policy	Claim Number	Policy Type_y	First Name_y	Last Name_y	LossDate	LAE	Loss	Premium_y
2	10962	Taisha	Whack	2017-09-19	Dentist	OCC	11308.0	1000000.0	3900000.0	91809.0	NaN	NaN	NaN	NaN	NaT	NaN	NaN	NaN
4	11339	Ricarda	Hotaka	2017-12-25	Podiatrist	OCC	2651.0	1000000.0	3900000.0	91367.0	NaN	NaN	NaN	NaN	NaT	NaN	NaN	NaN

In [9]:

## Drop duplicated data after merging 
df_merge_clean=df_combine.drop_duplicates()
df_merge_clean.head(2)

Out[9]:

	Policy Number	First Name_x	Last Name_x	Policy Effective Date	LOB	Policy Type_x	Premium_x	Coverage	Excess Coverage	Excess Policy	Claim Number	Policy Type_y	First Name_y	Last Name_y	LossDate	LAE	Loss	Premium_y
0	10880	Kurtis	Dumm	2017-07-08	Phy	CM	232058.0	1000000.0	3900000.0	96289.0	880206.0	CM	Kurtis	Dumm	2017-09-24	74793.0	250000.0	232058.0
1	10962	Taisha	Whack	2017-09-19	Dentist	OCC	11308.0	1000000.0	3900000.0	91809.0	NaN	NaN	NaN	NaN	NaT	NaN	NaN	NaN

Data Transformation

- Detecting outliers by applying array operations
- Loss should not exceeded 1 million since the policy limit is 1M
- Use the abs method in column Loss when loss is exceeding 1M

In [10]:

df_loss_1M = df_loss[df_loss['Loss'] > 1000000]

print("There are %s claims with losses > 1M - here they are:" % (df_loss_1M.shape[0]))  
df_loss_1M[['Claim Number','Loss']]

There are 8 claims with losses > 1M - here they are:

Out[10]:

	Claim Number	Loss
0	880247	10000000
1	880247	5000000
2	880255	4000000
6	880016	1500000
8	880205	2500000
13	880489	10000000
16	880754	7000000
23	880979	3000000

Discretization/Binning and Index Matching

- Group data into discrete bins
- Divide data into 4 bins: $1 to $50,000, $50,001 to $75,000, $75,001 to $1,000,000, and greater than $1,000,000 and less than $20M
- Use "cut" function for data index matching

In [11]:

## Sepaprated loss into bins and print out first five rows
bins = [1, 50000, 75000, 500000, 1000000,20000000]
cats = pd.cut(df_loss['Loss'], bins)

In [12]:

## Count how many claims in each bins - ie. there are 150 claims in the loss range from $75K to $500K
pd.value_counts(cats)

Out[12]:

(75000, 500000]        150
(50000, 75000]          35
(1, 50000]              18
(1000000, 20000000]      8
(500000, 1000000]        5
Name: Loss, dtype: int64

In [13]:

## Show first 3 claims according to the matching discrete loss bin categories
pd.cut(df_loss['Loss'], bins, right=False).head(3)

Out[13]:

0    [1000000, 20000000)
1    [1000000, 20000000)
2    [1000000, 20000000)
Name: Loss, dtype: category
Categories (5, interval[int64]): [[1, 50000) < [50000, 75000) < [75000, 500000) < [500000, 1000000) < [1000000, 20000000)]

In [14]:

## Group claim data into discrete loss bins by assigning group names, ie. first loss is a Clash coverage etc.
group_names = ['Layer1', 'Layer2', 'Layer3', 'XOL','Clash']
pd.cut(df_loss['Loss'], bins, labels=group_names).head()

Out[14]:

0     Clash
1     Clash
2     Clash
3    Layer2
4    Layer2
Name: Loss, dtype: category
Categories (5, object): [Layer1 < Layer2 < Layer3 < XOL < Clash]

Reshaping Data

- Group claims by line of business (LOB), not dollar amount
- Set Index
- View data by group
- Get max, min and mean value of the group

In [15]:

## Example file is a list of claims data
df_claims = pd.read_excel('data/analysis4/Claim.xlsx')
df_claims[0:2]

Out[15]:

	Claim_Number	Loss_Date	LAE	Loss	Premium	LOB
0	883745	2016-09-24	95846.0	147030	165910.0	Podiatrist
1	884654	2016-05-14	95514.0	121666	72314.0	Podiatrist

In [16]:

## Group claims by Line of Business LOB and set Loss_Date as index
df_claims.set_index('Loss_Date', inplace=True)
df_claims_by_lob = df_claims.groupby('LOB')

In [17]:

## View claim by LOB - Nurse 
df_claims_by_lob.get_group('Nurse').head(2)

Out[17]:

	Claim_Number	LAE	Loss	Premium	LOB
Loss_Date
2016-03-28	889851	25422.0	161796	208103.0	Nurse
2016-01-01	889591	75493.0	279318	81236.0	Nurse

In [18]:

## Maximum value by LOB
df_claims_by_lob.max()

Out[18]:

	Claim_Number	LAE	Loss	Premium
LOB
Dentist	889215	98396.0	290832	240383.0
Hospital	886684	99984.0	298038	245159.0
Nurse	889877	98914.0	279318	246635.0
Physician	881144	266019.0	10000000	248063.0
Podiatrist	884888	98013.0	294633	248669.0

Sumarize Statistics Value by Describe function

- Describe generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values
- Transpose the results
- Save the result to Excel called claim_distribution.xlsx

In [19]:

## Traspose the describe 
transport= df_claims_by_lob.describe().T
transport.head()

Out[19]:

	LOB	Dentist	Hospital	Nurse	Physician	Podiatrist
Claim_Number	count	39.000000	78.000000	21.000000	39.000000	39.000000
	mean	887691.615385	884071.576923	889543.666667	880730.512821	884066.743590
	std	696.825494	1885.292368	164.153993	391.910763	426.330531
	min	886716.000000	881331.000000	889258.000000	880016.000000	883421.000000
	25%	887096.500000	882258.750000	889453.000000	880307.500000	883704.000000

In [20]:

## Save the describe result to excel
df_claims_by_lob.describe().T.to_excel('data/analysis4/claim_distribution.xlsx')

Loss distribution by LOB - use groupby plot

- Use groupby function to plot the result
- This can be very helpful, that means we can reuse this code when reload the data from other application on the weekly, monthly, or quarterly basis
- To label y-axis in million, I found a code to do so.  I plan to explore more codes about label plot in the future blogs.

In [21]:

## How to add dollars signs to the y-axis
## ref: https://stackoverflow.com/questions/40566413/matplotlib-pyplot-auto-adjust-unit-of-y-axis
from matplotlib.ticker import FuncFormatter
def y_fmt(y, pos):
    decades = [1e9, 1e6, 1e3, 1e0, 1e-3, 1e-6, 1e-9 ]
    suffix  = ["G", "M", "k", "" , "m" , "u", "n"  ]
    if y == 0:
        return str(0)
    for i, d in enumerate(decades):
        if np.abs(y) >=d:
            val = y/float(d)
            signf = len(str(val).split(".")[1])
            if signf == 0:
                return '{val:d} {suffix}'.format(val=int(val), suffix=suffix[i])
            else:
                if signf == 1:
#                     print(val, signf)
                    if str(val).split(".")[1] == "0":
                       return '{val:d} {suffix}'.format(val=int(round(val)), suffix=suffix[i]) 
                tx = "{"+"val:.{signf}f".format(signf = signf) +"} {suffix}"
                return tx.format(val=val, suffix=suffix[i])

                #return y
    return y

def y_fmt2(x, y):
    return '{:2.2e}'.format(x).replace('e', 'x10^')


ax = df_claims_by_lob['Loss'].plot(legend=True)

ax[0].yaxis.set_major_formatter(FuncFormatter(y_fmt))
# ax[0].yaxis.set_major_formatter(FuncFormatter(y_fmt2))
plt.show()