Pareto Analysis (80/20 rule) and Python

Author: Iqbal Hossain

Pareto Analysis known as 80/20 rule is a statistical method in decision-making used for the selection of prioritise the tasks for significant effect. It is based on the idea that 80 percent of benefits can come from doing 20 percent of the works. In this small writing I tried to implement python plain codes and libraries to generate pareto analysis chart using very simple test data. I tried to make it easy to understand and reveal the detail codes to re-use for your project.

Think we are analysing health care related data and we are tyring to figure out the clinical errors after doctor visit and prescription issue. Our survey received following data.
Errors Case found
Wrong prescription 2
Over dose intake 20
Low dose intake 10
Repeated doses intake 5
Medicine wrong time intake 42
Patient unawareness 21
Intake forgotten 3
Wrong medicine intake 1

Despite of the number of cases high or low, it may not wise to decide which errors are more significant and which are less. We should no prioritize base on their number of frequency. In such case we may use Pareto 80/20 rule which may help us to identify which errors are more critical and which are trivial. The following codes were written in plain python thus it could be self explanatory to the reader.


    import matplotlib as mpl
    mpl.use('TkAgg')
    from matplotlib import pyplot as plt
    import pandas as pd
    import numpy as np

    error_cat = ["Wrong\nprescription", "Over\ndose\nintake", "Low\ndose\nintake", "Repeated\ndoses\nintake",
                    "Wrong\ntime\nintake", "Patient\nunawareness", "Intake\nforgotten", "Wrong\nmedicine\nintake",
                    "Medicine\nunavailable"]
    error_freq = [2, 20, 10, 5, 42, 21, 30, 20, 1]

    # make dataframe
    data = list(zip(error_cat, error_freq))
    df = pd.DataFrame(data, columns=['category', 'frequency'])

    # sort by frequency and re-index the rows
    df = df.sort_values(['frequency'], ascending=False).reset_index(drop=True)

    sum_for_frequency = df["frequency"].sum()

    # calculate relative frequency and cumulative frequency
    df["relative_frequency"] = round((df["frequency"] / sum_for_frequency) * 100, 2)
    df["cumulative_frequency"] = np.cumsum(df["relative_frequency"])

    # prepare plot
    fig, axes1 = plt.subplots()
    plt.xticks(rotation=0, fontsize=8)

    # prepare axes
    axes1.set_ylim(0, df["frequency"].max() + 10)
    axes1.set_ylabel("frequency", color='black')
    axes1.spines['left'].set_visible(True)
    axes1.spines['top'].set_visible(False)
    axes1.yaxis.set_visible(True)
    axes1_bars = axes1.bar(df['category'], df['frequency'], color="#818380", zorder=10)

    df_20 = df.loc[df["cumulative_frequency"] < 80]
    critical_few = df_20.shape[0]
    trivial_many = df.shape[0] - critical_few

    # mark critical bars and show frequency on top of the bar
    for i in range(critical_few):
        axes1_bars[i].set_color("#173F5F")
        plt.text(axes1_bars[i].get_x() + axes1_bars[i].get_width() / 3, axes1_bars[i].get_height() + 1,
                    axes1_bars[i].get_height(), fontsize=9, color='black', zorder=20)

    # share x-axis and prepare axes2
    axes2 = axes1.twinx()
    axes2.set_ylim(0, 105)
    axes2.set_ylabel("cumulative frequency in %", color="gray")
    axes2.spines['left'].set_visible(False)
    axes2.spines['top'].set_visible(False)
    axes2.tick_params(axis='y', colors='gray')
    axes2.plot(df['category'], df['cumulative_frequency'], "--o", color='#767775', linewidth=.8)

    # share y-axis and prepare axes3 for 80/20 marks
    ax3 = axes2.twiny()
    ax3.spines['left'].set_visible(False)
    ax3.spines['top'].set_visible(True)
    ax3.spines['right'].set_visible(False)
    ax3.yaxis.set_visible(False)
    ax3.xaxis.set_visible(False)
    ax3.set_xlim([0, df.shape[0]])

    # draw insertion line and mark critical and trivial
    ax3.axhline(80, 1, critical_few/10, color="red", linestyle="--", linewidth=.5)
    ax3.axvline(critical_few, .8, 0, color="red", linestyle="--", linewidth=.5)
    ax3.text(critical_few - 2, 100, "critical (20%)", fontsize=10, color="red")
    ax3.text(critical_few + 2, 60, "trivial (80%)", fontsize=10, color="#818380")

    # set title and show plot
    title = plt.title("critical vs trivial clinical errors : n = %s" % df["frequency"].sum(), loc="center", fontsize=10)
    plt.setp(title, color="black")
    fig.tight_layout()
    plt.show()
    

Prioritise critical errors of clinical trial



You can modify the codes according to your needs. As well as you may improve the codes and most welcome to contribute it to my email address thus I can put it here.