PCFG Based Synthetic Mobility Trace Generation S. C. Geyik, E. Bulut, and B. K. Szymanski Department of Computer Science, Rensselaer Polytechnic Institute Methodology Probabilistic Context Free Grammars (PCFG) Context free grammars with probabilities assigned to productions Ex: START a START (0.4) | a (0.6) Probability of “ a a” (P(START a START) x P(START a)) = 0.4 x 0.6 = 0.24 Modeling behavior sequences by constructing PCFGs from training data Only positive data is enough for Probabilistic CFGs ! Automated Construction of PCFGs The algorithm is introduced in [1]. Two stages in Construction: (i) Data Incorporation: All sentences are introduced to the initial grammar as rules of the START non-terminal. One non-terminal for each terminal with probability 1.0. Chunking: Generate a new non-terminal for a string and replace all occurrences of this string with the non-terminal, updating its frequency Merging: Combine two non-terminals and replace all the occurrences of each non-terminal with this non-terminal (Generalization) Goodness of grammar: Bayesian posterior probability P(G|D) of the grammar G given the training data D is defined as: P(G|D) = P(G) x P(D|G) P(D) P(G) is related to description length and P(D|G) is related to sentence probabilities of training data. Time complexity of O(D 2 log(D)) in (1). (ii) Application of Operators: Chunk and Merge operators applied at each step to generalize and miniaturize the grammar. Motivation Need for long-term mobility data for initial testing of protocols, before actual deployment It is difficult to collect realistic long-term traces A method is needed to increase the amount of the collected mobility traces, keeping the characteristics Contributions A Probabilistic Context Free Grammar (PCFG) based method to generate synthetic mobility traces A PCFG is learned from a real mobility trace where each sentence that can be produced by this PCFG represents a movement sequence in this real trace We provide extensions to the PCFG model to capture spatial and temporal mobility characteristics Once a PCFG is constructed, many sentences can be produced, dictating the movements of an entity, hence providing the synthetic mobility data Extensions to the PCFG Model Spatial mobility properties are modeled by representing each location with a terminal symbol in the PCFG Example: a node starts at location l A, then goes to l B after 40 seconds and to l C after 25 seconds : l A, 40 l B 25 l C If t represents 25 seconds, this movement sequence can be integrated into the grammar training as: l A t t l B t l C !!! Notice the approximation of 40 seconds with two time tokens !!! Trade-off between the time interval of the time token (resolution) and the complexity of the grammar Terminal mobility properties are modeled by the time tokens (t) representing a preset amount of time between the location symbols (please note that an alternative is to have the time token represent a distribution and always use a single time token, but this is left for future work). Trace Generation Algorithm The algorithm creates a sentence from the constructed grammar which represents a movement sequence for a single mobile entity. Once the sequence is over, another sentence is generated. The sequences should be filtered according to the current location of the entity. This research was sponsored by US Army Research laboratory and the UK Ministry of Defence and was accomplished under Agreement Number W911NF and under Cooperative Agreement Number W911NF The views and conclusions contained in this document are those of the authors, and should not be interpreted as representing the official policies, either expressed or implied, of the US Army Research Laboratory, the U.S. Government, the UK Ministry of Defense, or the UK Government. The US and UK Governments are authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon. Acknowledgment [1] Geyik, S. C., Szymanski, B., Event Recognition in Sensor Networks by Means of Grammatical Inference, IEEE INFOCOM 2009, Rio de Janeiro, Brazil, March 2009, Page(s): [2] Zhang, X., Kurose, J., Levine, B., N., Towsley, D., Zhang, H., Study of a bus-based disruption tolerant network: Mobility modeling and impact on routing, In Proc. ACM Annual Intl. Conf. on Mobile Computing and Networking (Mobicom), Page(s): , [3] Piorkowski, M., Sarafijanovoc-Djukic, N., Grossglauser, M., A Parsimonious Model of Mobile Partitioned Networks with Clustering, The First International Conference on COMmunication Systems and NETworkS (COMSNETS), 2009, Bangalore, India. References A PCFG-based method is provided to generate long-term synthetic mobility traces, crucial for protocol testing. The method performs better than Markov models in closely modeling the mobility characteristics of the real world traces. Future work includes future movement prediction for mobile entities and application of the PCFG-based method to other domains. Conclusions and Future Work Evaluations SF Cab Mobility Trace [3] Taxi movements in San Francisco Evaluated based on how much the synthetic traces coincide with actual traces (i.e. how much time passes between location change or how accurate the next location is) We consider two metrics. One is how accurate are the synthetic traces in simulating the next movement (next location or next mobile entity encountered) given the previous k movements (e.g. Cons 2 means the next movement given just the previous movement (2-1=1 history)). The other one is how accurate are the synthetic traces in simulating the time it takes for the next movement given the previous k movements (e.g. Intern 2 is analogous to Cons 2). We used Euclidean distances and used weighting to calculate the scores We compared the closeness of the PCFG and 2-Level Markov Model generated traces to the actual trace Previous work suggests goodness of Markov Models on prediction and modeling, furthermore shows that more than two levels does not perform better. Time complexity of building a 2-Level Markov Model is O(D log(D)) where D is the size of the trace. This can be calculated by taking the worst-case number of entries in the state table to be D and updating the statistics by each transition in the trace to be log(D) to find the necessary entry. Bus to bus meeting data collected in Amherst, MA. Evaluated based on how much time is between consecutive meetings and how accurate is the set of buses met by a bus in a certain route. DieselNet Trace [2]