Mining and Predicting Smart Device User Behavior

Three types of user behavior are mined in this paper: application usage, smart device usage and periodicity of user behavior. When mining application usage, the application installation, most frequently used applications and application correlation are analyzed. The application usage is long-tailed. When mining the device usage, the mean, variance and autocorrelation are calculated both for duration and interval. Both the duration and interval are long-tailed but only duration satisfies power-law distribution. Meanwhile, the autocorrelation of both duration and interval is weak, which makes predicting user behavior based on adjacent behavior not so reasonable in related works. Then DFT (Discrete Fourier Transform) is utilized to analyze the periodicity of user behavior and results show that the most obvious periodicity is 24 hours, which is in agreement with related works. Based on the results above, an improved user behavior predicting model is proposed based on Chebyshev inequality. Experiment results show that the performance is good in accurate rate and recall rate.


Introduction
Mobile services and applications have experienced explosive development in recent years.All the personalized services are based on the understanding of user behavior.Smart device is the most intimate equipment for users, and thus the mining of smart device usage behavior is the most important area of mining user behavior, and can contribute much to personalized services.
There have been many researches on the mining of mobile user behavior.[1] predicted user's mood by mining the usage of smart devices.[2] focused on the payment behavior on smart devices.[3]studied how applications are used to save energy.[4][5] recommended applications by analyzing applications usage behavior on smart phones.[6] classified applications by natural language processing method using the data from app store.[7] studied the relationship of application usage and geographical position.[8][9] collected much information such as position, time and sensor data and predicted user behavior.[10] classified users by their behavior.
Although there are many researches on mining mobile user behavior, the focuses of these researches are various.There still lack researches on the mining of application usage, smart device usage and time feature of user behavior.This paper collects sufficient data and mines user behavior on the above three aspect.

Data Collection
The following three types of data are collected in this paper: application usage, smart device usage, and application installation.
For the application usage, the data format is (user i , time j , app k ), which means user i uses app k at time j .In Android, this can be obtained by method getRunningTasks() of class ActivityManager.The traditional telecommunication application, such as call and SMS are filtered out in this paper.
For the smart device usage, the data format is (user i , time j1 , time j2 ), means user i begins to use smart device at time j1 , and stops using it at time j2 .The start and end of using smart device is reflected in the on/off state of device's screen.In Android, this can be obtained by registering a BroadcastReceiver which can receive the event ACTION_SCREEN_ON and event ACTION_SCREEN_OFF.
The application installation can be obtained by method getInstalledPackages() of class PackageManager in Android.The build-in applications are not collected in this paper, such as phone, SMS, settings and so on.
The data collection code is integrated in specific version of the application At Tsinghua [11][12].Users are notified of the data collection by an announcement and users can choose to decline the data collection.
From 4 th , December, 2013 to 4 th April 2014, there are 2690 users accepting the data collection.Users are identified by the MAC address of smart device.

Application Usage Statistics
There are 14,293 different applications installed by the 2690 users.Of all these users, the maximum application installation is 226, and the minimum is 1.The average application installation is 39.85, and the standard deviation is 26.88.The most popular applications is shown in Table 1.Of all the installed applications, some are frequently used and some are rarely used.The usage frequency is apparently long-tailed, as shown in Table 2. Notably, there are nearly 70 percent applications never used during the four months.The most frequently used application is shown in Table 3. Applications are not independent from each other.In a specific period of device usage, users usually switch from one application to another.A period means users are using devices all the time during the period, when the screen is never off.The switch behavior reflects the correlation of applications.To describe this, a nn matrix C is introduces, where n means the number of all applications.In a period of device usage, if users switch from app i to app j , then c ij ++.The correlation of applications is shown in Table 4.

Device Usage Statistics
First the total time of smart device usage in one day is calculated.In the span of four months, of all the collected users, the longest time of device usage is 324.2 minutes in one day, and the shortest usage is 2 minutes, with non-use excluded.The average usage is 53.0 minutes, and standard deviation is 42.4 minutes.So we can see that there are no giant gap between the most active users and the least active users.
And then the duration of device usage at a time is analyzed.Here duration has the same meaning as period in 2.2, during which the screen is never off.Of all the durations, the average duration is 60.9 seconds, the standard deviation is 241.5 seconds, the maximum is 299.3 minutes, the minimum is 0.7 seconds and the coefficient of variation is 396.6%.As for one user, the duration is also different.The CDF (cumulative distribution function) of all these durations is shown in Figure 1(a).The function in log-log coordinates is nearly linear, as shown in Figure 1(b).Through the R-square test, the correlation coefficient is 0.9373, so it is concluded that the duration of smart device usage obeys the power-law distribution.And then the autocorrelation is analyzed in this paper.Autocorrelation analysis is usually used to reflect the degree of correlation between the values of the same sequence in different time.The first twenty points of autocorrelation are calculated and shown in Figure 1(c).2(a), the interval is also long-tailed, but Figure 2(b) shows that interval doesn't obey the power law distribution, the correlation coefficient is only 0.6934.While [13] has reviewed many researches on human behaviors and pointed out that many human behaviors, such as calling, sending short messages and sending emails all obey power law distribution.Here we find an exception.
From Figure 2(c), the autocorrelation is weak between adjacent device usages.So it is hard to predict how long the user will pick up his device again just according to the last few intervals.That is to say, the weak autocorrelation of both duration and interval make predicting user behavior based on adjacent behavior not so reasonable in related works.

Periodicity of User Behavior
DFT (Discrete Fourier Transform) is often utilized to analyze the periodic behavior.[14] has utilized DFT to understand user behavior.Here DFT is again performed in this paper to pave the way for the prediction of user behavior in Section 2.5.
First the concept of active degree is introduced to quantify how active a user is to use smart device.For every minute, if a user is using smart device, the active degree of this minute is 1, otherwise 0. For every ten minutes, the active degree is the sum of every minute.Ten minutes is the minimum time unit in the following steps.And then DFT is performed and the PSD (power spectral density) is shown in Figure 3.In Figure 3, the one unit of abscissa is 2/(NT)=6.6910 - Hz, where N=15645 and T=10min=600s.The PSD is long-tailed and here only first 500 values are presented.In Figure 3, the power spectral density reaches the peak when the abscissa equals 17, whit the corresponding periodicity equaling NT/(2360017)=24.4h.So it is concluded that the most obvious periodicity of user behavior is 24 hours, which is in agreement with [14].

Predicting User Behavior
There are not too many researches on predicting user behavior.Of all the existed relevant works, [14] is the most classic one.[14] first analyzed the periodicity of user behavior utilizing DFT, and then utilized Chebyshev inequality to predict the top k applications user is probably to use at a specific time and put these applications on the home screen to make users launch their target applications quickly.
Despite the solid theoretical basis, there is still one thing left to be discussed in [14].That is, user behavior is periodic and the most obvious periodicity is 24 hours, as shown in both [14] and this paper.But when performing predicting, [14] limit their focus in one day and predict the behavior at specific time x using history behavior at other time.For example, when [14] predicted the behavior at time 15, it used the history behavior at time 9:23 and time 22:08.
Figure 4.The predicting approach in [14].There are two unreasonable points in this way.First, the most obvious periodicity is 24 hours, but [14] predicted the behavior using the other time's history behavior.Second, as shown in Figure 1(c) and Figure 2(c), the autocorrelation of duration and interval is neither significant, so to predict user behavior using adjacent behavior is not a good choice.
Here we build a new model to predict user active degree.First let's make clear the problem.The aim is to predict the user behavior at the n th day using behavior data from day 1 to day n-1.
Let a i,x represent the user active degree at time x in day i.The smallest time unit is ten minutes.Smooth a i,x with ten minutes and turn it into a ' i,x .Calculate the mean and variance of a ' i,x (1  i  n-1), and notate as E and V separately.Use the notation A to represent the real user active degree which is to predict.According to Chebyshev inequality, expression (1) decides the relationship of value of A and its probability, where P(x) means the probability of event x,  stands for any positive value.
There are two ways to understand expression (1).In one way, given the acceptable threshold of error probability P th , we can get the minimum , notated as  min , which satisfies the inequality D /  2  P th .The  min is also the biggest deviation between the real active degree A and the mean of history active degree E. In other way, we have the confidence of (1-P th ) to say that the deviation between A and E is smaller than  min .In the other way, given the acceptable threshold of predicting deviation  th , we can get the biggest error probability P max = D /  th 2 .In other way, the probability of deviation between A and E bigger than  th is smaller than P max .The calculation of mean and variance of the first n days behavior is shown in expression (2) and expression (3).
To get the mean and variance of the n th day, data of first (n-1) days is all used, so the data of first (n-1) days should be all kept in storage and thus the space complexity is O(n).Besides, to calculate the variance, there are (n+2) times of multiply operation should be performed, so the time complexity is O(n).
Then iterative formulas (4) and ( 5) are drawn up to bring down the complexity.
From ( 4) and ( 5), to calculate the mean and variance of n+1 th day, only mean and variance of the n th day and user behavior of the n+1 th day are needed.That means, the iterative formulas don't require to keep data of the first (n-1) days, and bring down the space complexity from O(n) to O(1).Meanwhile, the calculation of the variance of the n+1 th day only perform 5 times of multiply operation, and bring down the time complexity from O(n) to O(1).

Evaluation index
Two indexes are defined here to evaluate the predicting model in Section 2.5.The first one is accurate rate, defined as the probability of the real active rate falls in the expected interval.According to Chebyshev inequality, the accurate rate can't be smaller than (1-P th ).Although the upper bound can't be lower because there is a distribution to reach the upper bound as pointed out in [15], usually Chebyshev inequality's upper bound is loose.So the accurate rate is still do be examined.Generation Internet" grants 61161140454 and the EU FP7 under grant number PIRSES-GA-2013-610524.

Figure 1 .Figure 2 .
Figure 2. (a) The CDF of intervals.(b) The CDF in log-log coordinates.(c) Autocorrelation of intervals.From Figure2(a), the interval is also long-tailed, but Figure2(b) shows that interval doesn't obey the power law distribution, the correlation coefficient is only 0.6934.While[13] has reviewed many researches on human behaviors and pointed out that many human behaviors, such as calling, sending short messages and sending emails all obey power law distribution.Here we find an exception.From Figure2(c), the autocorrelation is weak between adjacent device usages.So it is hard to predict how long the user will pick up his device again just according to the last few intervals.That is to say, the weak autocorrelation of both duration and interval make predicting user behavior based on adjacent behavior not so reasonable in related works.

Figure 3 .
Figure 3.The power spectral density of user behavior.In Figure3, the one unit of abscissa is 2/(NT)=6.6910-7Hz, where N=15645 and T=10min=600s.The PSD is long-tailed and here only first 500 values are presented.In Figure3,

Table 1 .
The

Table 2 .
The usage frequency of all applications

Table 3 .
Top 20 frequently used applications

Table 4 .
Top 20 correlations of application pairs