Funnel Analysis - AARRR Model Case Study

   Funnel analysis is a process-based data analysis method that can scientifically reflect user conversions at each stage. The funnel analysis model has been widely used in user behavior analysis products, and its functions are very powerful: it can evaluate the conversion status of the whole or each link, and the effect of promotional activities; it can also be combined with other data analysis models for in-depth user behavior analysis, so as to find users Reasons for churn to increase user volume, activity, and retention rates.


   The two most commonly used complementary metrics for funnel analysis are conversion rate and churn rate. For example, if 100 people visit an e-commerce website, 27 people pay successfully. There are 5 steps in this process. The conversion rate from the first step to the second step is 88%, the loss rate is 12%, the conversion rate from the second step to the third step is 32%, the loss rate is 68%...and so on. The conversion rate for the entire process was 27%, and the churn rate was 73%. This model is the classic funnel analysis model. Each layer of funnel is a funnel event. Among them, the core indicator is the conversion rate.


The three elements of the funnel:


Time: Specifically refers to the conversion cycle of the funnel, which is the collection of time required to complete each layer of the funnel.
Node: Each layer of funnel is a node.
Traffic: It is the crowd (number of people).


For time: Generally speaking, the shorter the conversion cycle of a funnel, the better.


For nodes: the core indicator is the conversion rate, and the calculation formula is: conversion rate = the number of people converted by events in this layer/the number of people converted by events in the upper layer


For traffic: the performance of different specific groups of people under the same funnel must be different. For example, in the same type of high-end electronic product funnel, the conversion rate of young people and old people must be different.


   AARRR model is one of the funnel analysis, which can guide us to detect and analyze which indicators, it is the abbreviation of 5 important links in the life cycle, including: user acquisition (Acquisition), user activation (Activation), user retention (Retention) ), revenue (Revenue), self-propagation (Refer)



insert image description here



   Taking online e-books as an example, from opening the app to purchasing, new users will go through a path that includes the following four main links:


1. Open the APP and enter the home page
2. Enter the book page
3. Start trial reading
4. Buy books


   In each link, only some users will go to the next link. Those who have reached it are called conversion, and those who have not reached it are called loss. This series of links is like a layer of funnels. Correspondingly, we call the method of layer-by-layer analysis and conversion funnel analysis. Through the funnel analysis method, we can find out which layers have particularly low conversion and deserve priority improvement.


   The user data is extracted from the background of the e-book platform, and the data is stored in the corresponding variable name, as follows:



insert image description here



   First find out the top 10 most active user IDs:



insert image description here



import numpy as np

active_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\active_uids.csv')
print(active_uids[:10])


   Get the quantity of 4 ID types:



insert image description here



import numpy as np

active_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\active_uids.csv')
enterbook_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\enterbook_uids.csv')
trial_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\trial_uids.csv')
paid_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\paid_uids.csv')

num_active_uids = len(active_uids)
num_enterbook_uids = len(enterbook_uids)
num_trial_uids = len(trial_uids)
num_paid_uids = len(paid_uids)

print("活跃用户数:%d" % num_active_uids)
print("打开书籍页面用户数:%d" % num_enterbook_uids)
print("试读用户数:%d" % num_trial_uids)
print("购书用户数:%d" % num_paid_uids)


   According to the number of IDs obtained, an intuitive column chart can be drawn:



insert image description here



import numpy as np
import matplotlib.pyplot as plt

active_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\active_uids.csv')
enterbook_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\enterbook_uids.csv')
trial_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\trial_uids.csv')
paid_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\paid_uids.csv')

num_active_uids = len(active_uids)
num_enterbook_uids = len(enterbook_uids)
num_trial_uids = len(trial_uids)
num_paid_uids = len(paid_uids)

x = ['Active','Enter Book','Trial','Paid']
data = [num_active_uids,num_enterbook_uids,num_trial_uids,num_paid_uids]
plt.bar(x,data)
plt.show()


   It can be clearly seen in the figure that the conversion rate from trial reading to purchase is not high. In fact, there are two different purchase conversion paths, we need to analyze them separately.


   First, we analyze those users who enter the book page and buy directly without trial reading. At this time, we can't simply look at the numbers, but operate according to the original user ID. The user ID purchased after trial reading must belong to the ID of the trial reading user and the ID of the purchasing user, that is, the part of the ID that overlaps between the two lists of trial_uids and paid_uids is the user ID that is purchased after trial reading. .


   How can I quickly find the overlapping parts of the two lists? Python has a set method to quickly deduplicate. You can use the intersection() method to execute s.intersection(t). Since the intersection of s and t is exactly the same as the intersection of t and s, the order does not matter. So t.intersection(s) has the same effect. The easiest way to write it is s & t or t & s.



insert image description here



import numpy as np
# import matplotlib.pyplot as plt
# active_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\active_uids.csv')
# enterbook_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\enterbook_uids.csv')

trial_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\trial_uids.csv')
paid_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\paid_uids.csv')

trial_uids_set = set(trial_uids)
paid_uids_set = set(paid_uids)
paid_with_trial_uids = trial_uids_set & paid_uids_set
num_paid_with_trial_uids = len(paid_with_trial_uids)


print("有%d位用户先试读后再购书" % num_paid_with_trial_uids)


   Through the output results, it is known that there are 22 users who try to read before purchasing books. However, the user ID purchased directly without trial reading must exist in paid_uids and must not exist in trial_uids. As long as the set is subtracted, the result can be obtained.



insert image description here



import numpy as np
#import matplotlib.pyplot as plt

# active_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\active_uids.csv')
# enterbook_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\enterbook_uids.csv')

trial_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\trial_uids.csv')
paid_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\paid_uids.csv')

trial_uids_set = set(trial_uids)
paid_uids_set = set(paid_uids)
paid_with_trial_uids = trial_uids_set & paid_uids_set

paid_without_trial_uids = paid_uids_set - paid_with_trial_uids
num_paid_without_trial_uids = len(paid_without_trial_uids)

print("有%d位用户没有试读就购书" % num_paid_without_trial_uids)


   Through the output results, it is known that 114 users purchased books without trial reading.


   We draw two funnels according to two different paths, because we mainly focus on the user after entering the book page, so we can temporarily ignore the layer of active users.



Funnel 1: The user enters the book page-trial book-try reading and then purchases in three steps.



insert image description here



import numpy as np
import matplotlib.pyplot as plt

active_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\active_uids.csv')
enterbook_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\enterbook_uids.csv')
trial_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\trial_uids.csv')
paid_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\paid_uids.csv')

num_active_uids = len(active_uids)
num_enterbook_uids = len(enterbook_uids)
num_trial_uids = len(trial_uids)
num_paid_uids = len(paid_uids)

trial_uids_set = set(trial_uids)
paid_uids_set = set(paid_uids)
paid_with_trial_uids = trial_uids_set & paid_uids_set

paid_without_trial_uids = paid_uids_set - paid_with_trial_uids
num_paid_without_trial_uids = len(paid_without_trial_uids)
num_paid_with_trial_uids = len(paid_with_trial_uids)

x = ['Enter Book', 'Trial', 'Trial Then Paid']
nums = [num_enterbook_uids, num_trial_uids, num_paid_with_trial_uids]
plt.bar(x, nums)
plt.show()


Funnel 2: The user enters the book page-direct purchase two steps.



insert image description here



import numpy as np
import matplotlib.pyplot as plt

active_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\active_uids.csv')
enterbook_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\enterbook_uids.csv')
trial_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\trial_uids.csv')
paid_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\paid_uids.csv')

num_active_uids = len(active_uids)
num_enterbook_uids = len(enterbook_uids)
num_trial_uids = len(trial_uids)
num_paid_uids = len(paid_uids)

trial_uids_set = set(trial_uids)
paid_uids_set = set(paid_uids)
paid_with_trial_uids = trial_uids_set & paid_uids_set

paid_without_trial_uids = paid_uids_set - paid_with_trial_uids
num_paid_without_trial_uids = len(paid_without_trial_uids)
num_paid_with_trial_uids = len(paid_with_trial_uids)

x = ['Enter Book', 'Paid Without Trial']
nums = [num_enterbook_uids, num_paid_without_trial_uids]
plt.bar(x, nums)
plt.show()


   Directly draw these two funnel diagrams together:



insert image description here



import numpy as np
import matplotlib.pyplot as plt

active_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\active_uids.csv')
enterbook_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\enterbook_uids.csv')
trial_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\trial_uids.csv')
paid_uids = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\paid_uids.csv')

num_active_uids = len(active_uids)
num_enterbook_uids = len(enterbook_uids)
num_trial_uids = len(trial_uids)
num_paid_uids = len(paid_uids)

trial_uids_set = set(trial_uids)
paid_uids_set = set(paid_uids)
paid_with_trial_uids = trial_uids_set & paid_uids_set

paid_without_trial_uids = paid_uids_set - paid_with_trial_uids
num_paid_without_trial_uids = len(paid_without_trial_uids)
num_paid_with_trial_uids = len(paid_with_trial_uids)

x = ['Enter Book', 'Trial Or Pay', 'Trial Then Pay']
nums_enter_book = [num_enterbook_uids, 0, 0]
plt.bar(x, nums_enter_book)

nums_trial = [0, num_trial_uids, num_paid_with_trial_uids]
plt.bar(x, nums_trial, label='Trial')

nums_without_trial = [0, num_paid_without_trial_uids, 0]
plt.bar(x, nums_without_trial, label='Pay Without Trial')

plt.legend()
plt.show()


   It can be seen from the figure that at the second level of the funnel, that is, the second step corresponding to the user entering the book page, most (973) users choose trial reading, and a small number (114) of users choose direct purchase; but trial reading users There were only 22 book buyers in the follow-up, that is to say, the number and conversion of books purchased by users after trial reading were far smaller than those who bought books directly after entering the book page. So, we came up with a bold idea - is trial reading useful? What happens if it is cancelled, will it increase? If we decide to cancel the trial reading function first, and then observe and compare the purchase conversion rate, the result may not be objective. Because the external environment is constantly changing over time, if the experiment is done on different dates, even if the results are different, it is difficult for us to explain whether the difference is caused by our changes on the APP, or because of the date itself, because different Dates (such as weekends, holidays, Double Eleven) will have an impact on user consumption behavior.


   We can introduce A/B testing, that is, if there are multiple versions of a product that you want to compare and choose, within the same period of time, the target population is randomly divided into multiple groups of equal number, and each group uses a different version of the product. Then compare the results.



insert image description here



insert image description here



   After the AB test, the experimental data of groups A and B are obtained from the background of the e-book, and the variable names of the two groups are defined.



insert image description here



insert image description here



insert image description here



import numpy as np

active_uids_A = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\AB\active_uids_A.csv')
enterbook_uids_A = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\AB\enterbook_uids_A.csv')
trial_uids_A = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\AB\trial_uids_A.csv')
paid_uids_A = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\AB\paid_uids_A.csv')

active_uids_B = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\AB\active_uids_B.csv')
enterbook_uids_B = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\AB\enterbook_uids_B.csv')
paid_uids_B = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\AB\paid_uids_B.csv')

num_active_uids_A = len(active_uids_A)
num_enterbook_uids_A = len(enterbook_uids_A)
num_trial_uids_A = len(trial_uids_A)
num_paid_uids_A = len(paid_uids_A)

num_active_uids_B = len(active_uids_A)
num_enterbook_uids_B = len(enterbook_uids_A)
num_paid_uids_B = len(paid_uids_A)

print("A组数据")
print("-----------")
print("活跃用户数:%d" % num_active_uids_A)
print("打开书籍页面用户数:%d" % num_enterbook_uids_A)
print("试读用户数:%d" % num_trial_uids_A)
print("购书用户数:%d" % num_paid_uids_A)
print("    ")
print("B组数据")
print("-----------")
print("活跃用户数:%d" % num_active_uids_B)
print("打开书籍页面用户数:%d" % num_enterbook_uids_B)
print("购书用户数:%d" % num_paid_uids_B)


   After obtaining the key data of the above two groups of A and B, use the histogram again to compare the purchase conversion funnel of the two groups:



insert image description here



import numpy as np
import matplotlib.pyplot as plt

active_uids_A = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\AB\active_uids_A.csv')
enterbook_uids_A = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\AB\enterbook_uids_A.csv')
trial_uids_A = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\AB\trial_uids_A.csv')
paid_uids_A = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\AB\paid_uids_A.csv')

active_uids_B = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\AB\active_uids_B.csv')
enterbook_uids_B = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\AB\enterbook_uids_B.csv')
paid_uids_B = np.genfromtxt(r'C:\Users\Administrator\Desktop\funnel_analyst\AB\paid_uids_B.csv')

num_active_uids_A = len(active_uids_A)
num_enterbook_uids_A = len(enterbook_uids_A)
num_trial_uids_A = len(trial_uids_A)
num_paid_uids_A = len(paid_uids_A)

num_active_uids_B = len(active_uids_B)
num_enterbook_uids_B = len(enterbook_uids_B)
num_trial_uids_B = 0
num_paid_uids_B = len(paid_uids_B)

labels = ['Enter Book', 'Trial', 'Pay']
data_A = [num_enterbook_uids_A, num_trial_uids_A, num_paid_uids_A]
data_B = [num_enterbook_uids_B, num_trial_uids_B, num_paid_uids_B]

x = np.arange(len(labels))
width = 0.35
plt.bar(x - width/2, data_A, width, label='A')
plt.bar(x + width/2, data_B, width, label='B')

plt.xticks(x, labels)
plt.legend()
plt.show()


   We can find that at the level of entering the book page, the data of the two groups A and B are similar, because our test is for the trial button inside the book page, which will not affect the user's judgment of entering the book page from the book home page. Because at this time they can't see what the pages of the book look like. At the same time, it can also be seen that the payment data of group B is significantly higher than that of group A. Can we draw a conclusion that after canceling the trial reading, sales will be better? After continuing this AB test for a week, I found that the results every day were the same as the first day, which also proved the effectiveness of this test.


   Although the data verifies that users are more willing to buy after canceling the trial reading, but what is the reason behind this? Obviously other online reading software and platforms have trial reading functions, are they all wrong? Therefore, to gain a deeper understanding of why, it is necessary to analyze the behavior and data of this part of the trial users.


After the user starts trial reading and enters the book reading page for the first time, there are three possible behaviors as follows:

Reading
Next Page
Exit Reading


   Then let's study how users make behavior choices. The next page and exit reading are all corresponding to specific buttons. The program can detect whether the user has clicked these two buttons, but how do we know if the user has read a book? Although we cannot directly know whether the user is reading or not, we can know how much time the user spends on the reading interface. Except for very few exceptions, the longer the user stays, the more likely he is reading. Therefore, we conducted an in-depth analysis by invoking the trial reading data of this e-book in the past week:



insert image description here



import pandas as pd

prince_trial = r'C:\Users\Administrator\Desktop\funnel_analyst\AB\ice_trial.csv'
df = pd.read_csv(prince_trial)
print(df.head())

We get the data of the top 5 above, each


user_id: user ID
time_stary: stay time (s)
next_page: whether to click on the next page
quit_trial: quit reading
finish_trial: finish the trial
pay: purchase


   Taking the first row as an example, it means that the user with ID 125777 tried reading, stayed on the first page for 26 seconds, clicked the next page, did not choose to exit reading, did not complete the trial reading, and finally did not purchase.


   Analyze the dwell time distribution of users:



insert image description here



import pandas as pd
import matplotlib.pyplot as plt

prince_trial = r'C:\Users\Administrator\Desktop\funnel_analyst\AB\ice_trial.csv'
df = pd.read_csv(prince_trial)

plt.hist(df['time_stay'], bins=10, facecolor="blue", edgecolor="black", alpha=0.7)
plt.xlabel("Time Staying")
plt.ylabel("Number of Users")
plt.show()


   It can be found that most of the trial users stay on the first page for no more than 30 seconds. If it does not exceed 30 seconds, it is obvious that they cannot finish reading the content of the page. The possibility of them clicking the next page should not be too high. big. In order to verify this conjecture, we divided the users into two groups, one group whose reading time is 30 seconds or less, and the other group whose reading time is more than 30 seconds. What is the proportion of clicking on the next page?



insert image description here



import pandas as pd
import matplotlib.pyplot as plt
prince_trial = r'C:\Users\Administrator\Desktop\funnel_analyst\AB\ice_trial.csv'

df = pd.read_csv(prince_trial)

# 从所有数据中查找 time_stay 字段的值小于等于 30 的数据记录
under_30 = df[df['time_stay'] <= 30]

# 从所有数据中查找 time_stay 字段的值大 30 的数据记录
above_30 = df[df['time_stay'] > 30]

# 记录两组人数
num_under_30 = len(under_30)
num_above_30 = len(above_30)

under_30_and_next_page = under_30[under_30['next_page']]
above_30_and_next_page = above_30[above_30['next_page']]

num_under_30_and_next_page = len(under_30_and_next_page)
num_above_30_and_next_page = len(above_30_and_next_page)

ratio_under_30_and_next_page = num_under_30_and_next_page / num_under_30
ratio_above_30_and_next_page = num_above_30_and_next_page / num_above_30

print("阅读时间在30秒及以下,总人数为%d,点击下一页的人数为%d,比例为%.0f%%" %
  (num_under_30, num_under_30_and_next_page, 100 * ratio_under_30_and_next_page))
print("阅读时间在30秒以上,总人数为%d,点击下一页的人数为%d,比例为%.0f%%" %
  (num_above_30, num_above_30_and_next_page, 100 * ratio_above_30_and_next_page))


This gives us the following ratio results:


The reading time is 30 seconds or less, the total number of people is 1134, the number of people who click on the next page is 9, the proportion is 1%, the
reading time is more than 30 seconds, the total number of people is 188, the number of people who click on the next page is 64, the proportion is 34 %


   We have discovered the correlation between reading time and whether the user clicks on the next page, or even purchase. So, can we say that as long as we find a way to make users read longer, we can eventually increase purchases? Correlation does not imply causation. For example, a supermarket finds that there is a correlation between the sales of ice cream and air conditioners, but it does not mean that the supermarket can increase the sales of air conditioners by finding ways to increase the sales of ice cream. Because the hot weather in summer is the reason for the increase in sales of the two at the same time.


  In the product, there is likely to be a factor that will cause users to read for a short time and be unwilling to buy at the same time. What is the factor? At this time, we also thought that the main reason why users can't understand a Chinese book is because the content is too complicated, and the reason why users can't understand an English book is probably because of insufficient English ability. So, are so many users who gave up because they wanted to read the English version, but they really read it and found that their English ability is not enough, so they just gave up? To verify this, we can either find enough users to conduct surveys and interviews, or we can directly ask users in the product. So we made a pop-up window when the user clicked back, asking the user why they gave up, whether they were not interested in the content itself, or found it too difficult, or both, or neither.


   By collecting data for several days, the following feedback data is obtained:


Not interested: 152


Can't understand English: 888


Both: 122


Other reasons: 130


   Visually display feedback data by drawing a pie chart:



insert image description here



import matplotlib.pyplot as plt

labels = ['Not Interested', 'Don\'t Understand', 'Both', 'Neither']
data = [152, 888, 122, 130]

colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue']
explode = (0, 0.1, 0, 0)
plt.pie(data, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.show()


   Now we can confirm that in the previous AB test, most of the users in Group A who chose the trial reading did give up hastily because they found that they could not understand during the trial reading stage; while the users in Group B did not have a trial reading opportunity, Many users who couldn't understand it didn't know that they couldn't understand it, so they bought it first. The reason why the Chinese book reading APP can generously provide the trial reading function is because it is unlikely to give up buying because of incomprehension after the trial reading. Therefore, we have experienced a complete process of discovering problems, proposing hypotheses, testing and analyzing, and proposing solutions. In this process, data analysis plays an indispensable but not the only role, because discovering problems and proposing hypotheses requires insight, and proposing solutions requires creativity, and these abilities cannot be acquired just by looking at numbers. Therefore, data analysis is only a means to help us find problems. We can use data as the starting point of the analysis, but not as the end point of the analysis. We must deeply understand the users and understand their usage scenarios and difficulties in principle. Only then can we continue to iterate products and let ourselves keep improving.

Guess you like

Origin blog.csdn.net/weixin_48591974/article/details/128257614