A New Method to Forecast Enrollments Using Fuzzy Time Series and Clustering Techniques Kurniawan Tanuwijaya 1 and Shyi-Ming Chen 1, 2 1 Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan, R.O.C. 2 Department of Computer Science and Information Engineering, Jinwen University of Science and Technology, Taipei County, Taiwan, R.O.C.
Outline 1.Introduction 2.A Review of Fuzzy Time Series 3.The Proposed Clustering Algorithm 4.A New method for Forecasting Enrollments Based on the Fuzzy Time Series and the Proposed Clustering Algorithm 5.Experimental Results 6.Conclusions
1. Introduction [1993] Song and Chissom proposed the concepts of fuzzy time series. Use two fuzzy time series models (i.e., time-variant and time- invariant fuzzy time series) to forecast the enrollments of the University of Alabama [1994] Sullivan and Woodall compared Song and Chissom’s methods with a time-invariant Markov Model using linguistic label [1996] Chen proposed a simple arithmetic operations [2001] Huarng presented a heuristic model by integrating Chen’s model
1. Introduction (cont.) [2002] Chen presented high-order fuzzy time series. [2006] Hwang, et. al. presented time-variant fuzzy time series. In this paper, we present a new method to forecast the enrollments of the University of Alabama based on fuzzy time series and clustering techniques.
2. A Review of Fuzzy Time Series A fuzzy set A of the universe of discourse U, U = {u 1, u 2, …, u n }, is defined as follows: where f A is the membership function of the fuzzy set A, f A (u i ) denotes the grade of membership of u i in the fuzzy set A, and 1≤ i ≤ n. Let Y(t) (t = …, 0, 1, 2, …) be the universe of discourse in which fuzzy set f i (t) (i = 1, 2, …) are defined. Let F(t) be a collection of f i (t) (i = 1, 2, …). Then, F(t) is a fuzzy time series on Y(t) (t = …, 0, 1, 2, …).
2. A Review of Fuzzy Time Series (cont.) Assume that there is fuzzy relationship R(t, t-1) between F(t-1) and F(t), such that F(t) = F(t-1) R(t, t- 1), where “” is the Max-Min composition operator, then F(t) is called caused by F(t-1) and it is denoted by a fuzzy logical relationship, shown as follows: where both F(t-1) and F(t) are fuzzy sets and “ F(t-1)” and “ F(t)” are called the current state and next state, respectively.
2. A Review of Fuzzy Time Series (cont.) Let F(t) be a fuzzy time series. If F(t) is caused by F(t-1), F(t-2), …, and F(t-n), then the fuzzy logical relationship between them can be represented by a high-order fuzzy logical relationship, shown as follows: where F(t-n), …, F(t-2), and F(t-1) are fuzzy sets, respectively, and “ F(t-n), …, F(t-2), F(t-1) ” and “ F(t) ” are called the current state and the next state of the high order fuzzy logical relationship, respectively. The fuzzy logical relationships having the same current state are grouped into a fuzzy logical relationship group.
3. The Proposed Clustering Algorithm The proposed clustering algorithm is used to partition universe of discourse into different lengths of intervals. Step 1: Sort the numerical data in ascending sequence, shown as follows: Calculate the threshold value for stopping condition of the proposed clustering algorithm, shown as follows: (1)
3. The Proposed Clustering Algorithm (cont.) Step 2: Put each datum into a cluster, shown as follows where the symbol “ {} ” denotes a cluster. Step 3: Assume that there are p clusters, calculate the cluster center cluster_center k of each cluster Cluster k as follows: (2) where d j is the data in Cluster k, r is the number of the data in Cluster k, and 1≤ k ≤ p.
3. The Proposed Clustering Algorithm (cont.) Calculate the distance distance m,m+1 between any two adjacent cluster centers cluster_center m and cluster_center m+1, shown as follows: (3) where m = 1, 2, …, p-1. Step 4: Find the smallest distance smallest_distance : (4) Step 5: If smallest_distance <, then combine the clusters having the smallest distance between them into a cluster and go to Step 3. Otherwise, go to Step 6.
3. The Proposed Clustering Algorithm (cont.) Step 6: Calculate the upper bound cluster_uBound m of Cluster m and the lower bound cluster_lBound m+1 of Cluster m+1 : (5) (6) where m = 1, 2, …, p-1. Because there is no previous cluster before the first cluster and there is no next cluster after the last cluster, the lower bound cluster_lBound 1 of the first cluster and the upper bound cluster_uBound p of the last cluster can be calculated as follows: (7) (8)
3. The Proposed Clustering Algorithm (cont.) Step 7: Let each cluster Cluster k form an interval interval k, which means that the upper bound cluster_uBound k and the lower bound cluster_lBound k of the cluster Cluster k are also the upper bound interval_uBound k and the lower bound interval_lBound k of the interval interval k, respectively. Calculate the middle value mid_value k of the interval interval k as follows: (9) where 1≤ k ≤ p.
4. A New Method for Forecasting Enrollments Based on the Fuzzy Time Series and The Proposed Clustering Algorithm Step 1: Apply the proposed clustering algorithm to partition the universe of discourse. Step 2: Assume that there are n intervals u 1, u 2, …, u n obtained in Step 1, then define linguistic terms A 1, A 2, …, A n represented by fuzzy sets, shown as follows: Step 3: Fuzzify each historical datum into a fuzzy set. If the datum is belonging to u i, then the datum is fuzzified into A i, where 1≤ i ≤ n.
4. A New Method for Forecasting Enrollments Based on the Fuzzy Time Series and The Proposed Clustering Algorithm (cont.) Step 4: Construct the fuzzy logical relationship based on the fuzzified data obtained in Step 3. (Note: If the first order fuzzy time series is used and the fuzzified values of time t-1 and t are A j and A k, respectively, then construct the fuzzy logical relationship “ A j → A k ”, where “ A j ” and “ A k ” are called the current state and the next state of the fuzzy logical relationship. If the n th order fuzzy time series is used and the fuzzified values of time t-n, …, t-2, t-1 and t are A j,n, …, A j,2, A j,1 and A k, respectively, then construct the fuzzy logical relationship “ A j,n, …, A j,2, A j,1 → A k ”, where “ A j,n, …, A j,2, A j,1 ” and “ A k ” are called the current state and the next state of the n th order fuzzy logical relationship). Based on the current state of the fuzzy logical relationships, let the fuzzy logical relationships having the same current state to form a fuzzy logical relationship group.
4. A New Method for Forecasting Enrollments Based on the Fuzzy Time Series and The Proposed Clustering Algorithm (cont.) Step 5: Calculate the forecasted output at time t by using the following principles: Principle 1: If the fuzzified values at time t-n, …, t-2, and t-1 are A j,n, …, A j,2, and A j,1, respectively, and there is only one fuzzy logical relationship in the fuzzy logical relationship groups, shown as follows: then the forecasted value of time t is m k, where m k is the middle value of the interval u k and the maximum membership value of A k occurs at interval u k.
4. A New Method for Forecasting Enrollments Based on the Fuzzy Time Series and The Proposed Clustering Algorithm (cont.) Principle 2: If the fuzzified values at time t-n, …, t-2, and t-1 are A j,n, …, A j,2, and A j,1, respectively, and there is only one fuzzy logical relationship in the fuzzy logical relationship groups, shown as follows: then the forecasted value of time t is calculated as follows: where x i denotes the number of fuzzy logical relationships “ A j,n, …, A j,2, A j,1 → A ki ” in the fuzzy logical relationship group, 1≤ i ≤ p ; m k1, m k2,…, and m kp are the middle value of the intervals u k1, u k2,…, and u kp, respectively, and the maximum membership values of A k1, A k2,…, and A kp occur at interval u k1, u k2,…, and u kp, respectively.
4. A New Method for Forecasting Enrollments Based on the Fuzzy Time Series and The Proposed Clustering Algorithm (cont.) Principle 3: If the fuzzified values at time t-n, …, t-2, and t-1 are A j,n, …, A j,2, and A j,1, respectively, and there is only one fuzzy logical relationship in the fuzzy logical relationship groups, shown as follows: then the forecasted value of time t is calculated as follows: where m j,n, …, m j,2 and m j,1 are the middle values of the intervals u j,n, …, u j,2 and u j,1, respectively, and the maximum membership values of A j,n, …, A j,2 and A j,1 occur at intervals u j,n, …, u j,2 and u j,1, respectively.
5. Experimental Results A. The Proposed Method using the First Order Fuzzy Time Series [Step 1] Apply the proposed clustering algorithm to partition UoD into different lengths of intervals: [Sub-Step 1] Sorting the numerical data: 13055, 13563, 13867, 14696, 15145, 15163, 15311, 15433, 15460, 15497, 15603, 15861, 15984, 16388, 16807, 16859, 16919, 18150, 18876, 18970, 19328, Calculate the threshold for stopping condition of the proposed clustering algorithm: YearActual Enrollments Table 1. Historical Enrollments of the University of Alabama
5. Experimental Results (cont.) [Sub-Step 2] Put each datum in a cluster, shown as follows: {13055}, {13563}, {13867}, {14696}, {15145}, {15163}, {15311}, {15433}, {15460}, {15497}, {15603}, {15861}, {15984}, {16388}, {16807}, {16859}, {16919}, {18150}, {18876}, {18970}, {19328}, {19337}. [Sub-Step 3] Based on Eq. (2), calculate each cluster center cluster_center k, 1≤ k ≤ 22, shown as follows: cluster_center 1 = 13055,cluster_center 9 = 15460,cluster_center 17 = 16919, cluster_center 2 = 13563,cluster_center 10 = 15497,cluster_center 18 = 18150, cluster_center 3 = 13867,cluster_center 11 = 15603,cluster_center 19 = 18876, cluster_center 4 = 14696,cluster_center 12 = 15861,cluster_center 20 = 18970, cluster_center 5 = 15145,cluster_center 13 = 15984,cluster_center 21 = 19328, cluster_center 6 = 15163,cluster_center 14 = 16388,cluster_center 22 = cluster_center 7 = 15311,cluster_center 15 = 16807, cluster_center 8 = 15433,cluster_center 16 = 16859,
5. Experimental Results (cont.) Based on Eq. (3), calculate the distance distance m,m+1, 1≤ m ≤ 21, shown as follows: [Sub-Step 4] Find the smallest distance smallest_distance, i.e., 9 (the distance distance 21,22 between cluster_center 21 and cluster_center 22 ). [Sub-Step 5] Because the smallest_distance <, i.e., 9 < 299 is true, then cluster 21 (i.e., {19328}) and cluster 22 (i.e., {19337}) are combined into one cluster (i.e., {19328, 19337}), and go to Sub-Step 3. distance 1,2 = 508,distance 8,9 = 27,distance 15,16 = 52, distance 2,3 = 304,distance 9,10 = 37,distance 16,17 = 60, distance 3,4 = 829,distance 10,11 = 106,distance 17,18 = 1231, distance 4,5 = 449,distance 11,12 = 258,distance 18,19 = 726, distance 5,6 = 18,distance 12,13 = 123,distance 19,20 = 94, distance 6,7 = 148,distance 13,14 = 404,distance 20,21 = 358, distance 7,8 = 122,distance 14,15 = 419,distance 21,22 = 9.
5. Experimental Results (cont.) The iterations of Sub-Step 3 to Sub-Step 5 are repeteadly done until the condition “ smallest_distance < ” is false. The final clustering results are shown as follows: {13055}, {13563}, {13867}, {14696}, {15145, 15163, 15311, 15433, 15460, 15497, 15603}, {15861, 15984}, {16388}, {16807, 16859, 16919}, {18150}, {18876, 18970}, {19328, 19337}.. [Sub-Step 6] Based on Eqs. (5) and (6), the upper bound and lower bound of each Cluster k, 1≤ k ≤ 11. For example:
5. Experimental Results (cont.) Because there is no previous cluster before Cluster 1, the lower bound of cluster_lBound 1 of Cluster 1 is calculated using Eq. (8) and because there is no next cluster after the last cluster, i.e., Cluster 11, the upper bound cluster_uBound 11 is calculated using Eq. (7).
5. Experimental Results (cont.) [Sub-Step 7] Let each Cluster k form an interval k and calculate the middle value using Eq. (9). Table 2. The Interval Generations from the Clusters ClusterDataCluster CenterLower BoundUpper BoundMiddle Value Cluster 1 {13055} Cluster 2 {13563} Cluster 3 {13867} Cluster 4 {14696} Cluster 5 {15145, 15163, 15311, 15433, 15460, 15497, 15603} Cluster 6 {15861, 15984} Cluster 7 {16388} Cluster 8 {16807, 16859, 16919} Cluster 9 {18150} Cluster 10 {18876, 18970} Cluster 11 {19328, 19337}
5. Experimental Results (cont.) For simplicity, after rounding the real values in Table 2 into integer, the following intervals can be get, shown as follows: [Step 2] Define the linguistic term A 1, A 2, …, and A 11, shown as follows: u 1 =[12801, 13309),u 5 =[15035, 16155),u 9 =[17506, 18537), u 2 =[13309, 13715),u 6 =[15648, 16155),u 10 =[18537, 19128), u 3 =[13715, 14282),u 7 =[16155, 16625),u 11 =[19128, 19333). u 4 =[14282, 15035),u 8 =[16625, 17506),
5. Experimental Results (cont.) [Step 3] Fuzzify each datum that is belonging to u i, where 1≤ i ≤ 11 into A i. [Step 4] Obtain the fuzzy logical relationships (FLR) of the first order fuzzy time series. Let the FLR having the same current state to form a FLR group (FLRG). YearActual EnrollmentsFuzzified Enrollments A1A A2A A3A A4A A5A A5A A5A A6A A8A A8A A7A A5A A5A A5A A5A A6A A8A A9A A A A A 10 Table 3. Fuzzified Enrollments of the University of Alabama Group 1: A 1 → A 2 Group 2: A 2 → A 3 Group 3: A 3 → A 4 Group 4: A 4 → A 5 Group 5: A 5 → A 5 (5), A 6 (2) Group 6: A 6 → A 8 (2) Group 7: A 7 → A 5 Group 8: A 8 → A 7, A 8, A 9 Group 9: A 9 → A 10 Group 10: A 10 → A 11 Group 11: A 11 → A 10, A 11 Table 4. FLRG of the First Order of Fuzzy Time Series
5. Experimental Results (cont.) [Step 5] Calculate the forecasting value. For example, the forecasted enrollment of the year 1978 is calculated as follows: From Table 3, we can see that the fuzzified enrollment of year 1977 is A 5. From Table 4, there is a FLR “ A 5 → A 5 (5), A 6 (2) ” in Group 5. Therefore the forecasted enrollment of year 1978 is calculated as follows: where and are the middle values of the intervals u 5 and u 6, respectively.
5. Experimental Results (cont.) Year Actual Enrollments Song and Chissom’s method Sulllivan and Woodall’s method Chen’s methodHuarng’s methodThe proposed method Not forecasted MSE Table 6. A MSE Comparison of the Proposed Method Using the First Order Fuzzy Time Series With the Existing Methods
5. Experimental Results (cont.) B. The Proposed Method using the Second Order Fuzzy Time Series The results of Steps 1-3 of the proposed method using the second order fuzzy time series are the same as the Steps 1-3 of the proposed method using the first order fuzzy time series. In the following, we illustrate the results of Step 4 and Step 5 of the proposed method using the second order fuzzy time series.
5. Experimental Results (cont.) [Step 4] Based on Table 3, we can construct the FLR of the second order Fuzzy Time Series. Let the FLR having the same current state to form a FLR group (FLRG). YearActual EnrollmentsFuzzified Enrollments A1A A2A A3A A4A A5A A5A A5A A6A A8A A8A A7A A5A A5A A5A A5A A6A A8A A9A A A A A 10 Table 3. Fuzzified Enrollments of the University of Alabama Group 1: A 1, A 2 → A 3 Group 2: A 2, A 3 → A 4 Group 3: A 3, A 4 → A 5 Group 4: A 4, A 5 → A 5 Group 5: A 5, A 5 → A 5 (3), A 6 (2) Group 6: A 5, A 6 → A 8 (2) Group 7: A 6, A 8 → A 8, A 9 Group 8: A 8, A 8 → A 7 Group 9: A 8, A 7 → A 5 Group 10: A 7, A 5 → A 5 Group 11: A 8, A 9 → A 10 Group 12: A 9, A 10 → A 11 Group 13: A 10, A 11 → A 11 Group 14: A 11, A 11 → A 10 Table 5. FLRG of the Second Order of Fuzzy Time Series
5. Experimental Results (cont.) [Step 5] Calculate the forecasting value. For example, the forecasted enrollment of the year 1988 is calculated as follows: From Table 3, we can see that the fuzzified enrollment of years 1986 and 1987 are A 6 and A 8, respectively. From Table 5, there is a FLR “ A 6, A 8 → A 8, A 9 ” in Group 7. Therefore, the forecasted enrollment of year 1988 is calculated as follows: where and are the middle values of the interval u 8 and u 9, respectively. In the same way, the forecasted enrollments of the University of Alabama of the other years using the second order fuzzy time series can be obtained.
5. Experimental Results (cont.) Method Order Hwang, Chen, and Lee’s method Chen’s methodThe proposed method Table 7. A MSE Comparison of the Proposed Method Using High-Order Fuzzy Time Series With the Existing Methods
6. Conclusions In this paper, we have presented a new method to forecast the enrollments of the University of Alabama using the first order fuzzy time series and the high-order fuzzy time series, respectively. The proposed method uses the proposed clustering algorithm to partition the universe of discourse into different lengths of intervals. The proposed method gets higher average forecasting accuracy rates than the existing methods, due to the fact that the proposed method gets smaller mean square errors (MSEs).