## Pareto Analysis (80/20 rule) and Python

Author: Iqbal Hossain

Pareto Analysis known as 80/20 rule is a statistical method in decision-making used for the selection of prioritise the tasks for significant effect. It is based on the idea that 80 percent of benefits can come from doing 20 percent of the works. In this small writing I tried to implement python plain codes and libraries to generate pareto analysis chart using very simple test data. I tried to make it easy to understand and reveal the detail codes to re-use for your project.

Think we are analysing health care related data and we are tyring to figure out the clinical errors after doctor visit and prescription issue. Our survey received following data.
Errors Case found
Wrong prescription 2
Over dose intake 20
Low dose intake 10
Repeated doses intake 5
Medicine wrong time intake 42
Patient unawareness 21
Intake forgotten 3
Wrong medicine intake 1

Despite of the number of cases high or low, it may not wise to decide which errors are more significant and which are less. We should no prioritize base on their number of frequency. In such case we may use Pareto 80/20 rule which may help us to identify which errors are more critical and which are trivial. The following codes were written in plain python thus it could be self explanatory to the reader.

```    import matplotlib as mpl
mpl.use('TkAgg')
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np

error_cat = ["Wrong\nprescription", "Over\ndose\nintake", "Low\ndose\nintake", "Repeated\ndoses\nintake",
"Wrong\ntime\nintake", "Patient\nunawareness", "Intake\nforgotten", "Wrong\nmedicine\nintake",
"Medicine\nunavailable"]
error_freq = [2, 20, 10, 5, 42, 21, 30, 20, 1]

# make dataframe
data = list(zip(error_cat, error_freq))
df = pd.DataFrame(data, columns=['category', 'frequency'])

# sort by frequency and re-index the rows
df = df.sort_values(['frequency'], ascending=False).reset_index(drop=True)

sum_for_frequency = df["frequency"].sum()

# calculate relative frequency and cumulative frequency
df["relative_frequency"] = round((df["frequency"] / sum_for_frequency) * 100, 2)
df["cumulative_frequency"] = np.cumsum(df["relative_frequency"])

# prepare plot
fig, axes1 = plt.subplots()
plt.xticks(rotation=0, fontsize=8)

# prepare axes
axes1.set_ylim(0, df["frequency"].max() + 10)
axes1.set_ylabel("frequency", color='black')
axes1.spines['left'].set_visible(True)
axes1.spines['top'].set_visible(False)
axes1.yaxis.set_visible(True)
axes1_bars = axes1.bar(df['category'], df['frequency'], color="#818380", zorder=10)

df_20 = df.loc[df["cumulative_frequency"] < 80]
critical_few = df_20.shape
trivial_many = df.shape - critical_few

# mark critical bars and show frequency on top of the bar
for i in range(critical_few):
axes1_bars[i].set_color("#173F5F")
plt.text(axes1_bars[i].get_x() + axes1_bars[i].get_width() / 3, axes1_bars[i].get_height() + 1,
axes1_bars[i].get_height(), fontsize=9, color='black', zorder=20)

# share x-axis and prepare axes2
axes2 = axes1.twinx()
axes2.set_ylim(0, 105)
axes2.set_ylabel("cumulative frequency in %", color="gray")
axes2.spines['left'].set_visible(False)
axes2.spines['top'].set_visible(False)
axes2.tick_params(axis='y', colors='gray')
axes2.plot(df['category'], df['cumulative_frequency'], "--o", color='#767775', linewidth=.8)

# share y-axis and prepare axes3 for 80/20 marks
ax3 = axes2.twiny()
ax3.spines['left'].set_visible(False)
ax3.spines['top'].set_visible(True)
ax3.spines['right'].set_visible(False)
ax3.yaxis.set_visible(False)
ax3.xaxis.set_visible(False)
ax3.set_xlim([0, df.shape])

# draw insertion line and mark critical and trivial
ax3.axhline(80, 1, critical_few/10, color="red", linestyle="--", linewidth=.5)
ax3.axvline(critical_few, .8, 0, color="red", linestyle="--", linewidth=.5)
ax3.text(critical_few - 2, 100, "critical (20%)", fontsize=10, color="red")
ax3.text(critical_few + 2, 60, "trivial (80%)", fontsize=10, color="#818380")

# set title and show plot
title = plt.title("critical vs trivial clinical errors : n = %s" % df["frequency"].sum(), loc="center", fontsize=10)
plt.setp(title, color="black")
fig.tight_layout()
plt.show()
```

## Prioritise critical errors of clinical trial You can modify the codes according to your needs. As well as you may improve the codes and most welcome to contribute it to my email address thus I can put it here.