Finite State Transducers

Finite State Transducers
Mark Stamp Finite State Transducers

Finite State Automata FSA  states and transitions State are circles:
Represented as labeled directed graphs FSA has one label per edge State are circles: Double circles for end states: Beginning state Denoted by arrowhead: Or, sometimes bold circle is used: Finite State Transducers

FSA Example Nodes are states Transitions are (labeled) arrows
For example… a c 1 3 z y 2 Finite State Transducers

Finite State Transducer
FST  input & output labels on edge That is, 2 labels per edge Can be more labels (e.g., edge weights) Recall, FSA has one label per edge FST represented as directed graph And same symbols used as for FSA FSTs may be useful in malware analysis… Finite State Transducers

Finite State Transducer
FST has input and output “tapes” Transducer, i.e., can map input to output Often viewed as “translating” machine But somewhat more general FST is a finite automata with output Usual finite automata only has input Used in natural language processing (NLP) Also used in many other applications Finite State Transducers

FST Graphically Edges/transitions are (labeled) arrows
Of the form, i : o, that is, input:ouput Nodes labeled numerically For example… a:b c:d 1 3 z:x y:q 2 Finite State Transducers

FST Modes FST usually viewed as translating machine
But FST can operate in several modes Generation Recognition Translation (left-to-right or right-to-left) Examples of modes considered next… Finite State Transducers

FST Modes Consider this simple example: Generation mode
a:b Consider this simple example: Generation mode Write equal number of a and b to first and second tape, respectively Recognition mode “Accept” when 1st tape has same number of a as 2nd tape has b Translation mode  next slide 1 Finite State Transducers

FST Modes Consider this simple example: Translation mode
a:b Consider this simple example: Translation mode Left-to-right  For every a read from 1st tape, write b to 2nd tape Right-to-left  For every b read from 2nd tape, write a to 1st tape Translation is the mode we usually want to consider 1 Finite State Transducers

WFST WFST == Weighted FST Often, probabilities serve as weights…
Include a “weight” on each edge That is, edges of the form i : o / w Often, probabilities serve as weights… a:b/1 c:d/0.6 1 3 z:x/0.4 y:q/1 2 Finite State Transducers

FST Example Homework… Finite State Transducers

Operations on FSTs Many well-defined operations on FSTs
Union, intersection, composition, etc. These also apply to WFSTs Composition is especially interesting In malware context, might want to… Compose detectors for same family Compose detectors for different families Why might this be useful? Finite State Transducers

FST Composition Compose 2 FSTs (or WFSTs)
Suppose 1st WFST has nodes 1,2,…,n Suppose 2nd WFST has nodes 1,2,…,m Possible nodes in composition labeled (i,j), for i = 1,2,…,n and j = 1,2,…,m Generally, not all of these will appear Edge from (i1,j1) to (i2,j2) only when composed labels “match” (next slide…) Finite State Transducers

FST Composition Suppose we have following labels
In 1st WFST, edge from i1 to i2 is x:y/p In 2nd WFST, edge from j1 to j2 is w:z/q Consider nodes (i1,j1) and (i2,j2) in composed WFST Edge between nodes provided y == w I.e., output from 1st matches input for 2nd And, resulting edge label is x:z/pq Finite State Transducers

WFST Composition Consider composition of WFSTs And… 3 1 2 4 3 1 2 4
a:b/0.5 a:a/0.6 b:b/0.3 a:b/0.1 1 2 4 b:b/0.4 a:b/0.2 3 b:a/0.5 a:b/0.3 b:b/0.1 1 2 4 a:b/0.4 b:a/0.2 Finite State Transducers

WFST Composition Example
3 a:b/0.5 a:a/0.6 b:b/0.3 a:b/0.1 1 2 4 b:b/0.4 a:b/0.2 3 b:a/0.5 a:b/0.3 b:b/0.1 1 2 4 a:b/0.4 b:a/0.2 a:a/.04 1,2 4,4 a:b/.01 a:b/.24 1,1 2,2 More details and algorithms can be found here: a:a/.02 b:a/.08 4,2 b:a/.06 a:b/.18 4,3 3,2 a:a/.1 Finite State Transducers

WFST Composition In previous example, composition is…
But (4,3) node is useless Must always end in a final state a:a/.04 1,2 4,4 a:b/.01 a:b/.24 1,1 2,2 a:a/.02 b:a/.08 4,2 b:a/.06 a:b/.18 4,3 3,2 a:a/.1 Finite State Transducers

FST Approximation of HMM
Why would we want to approximate an HMM by FST? Faster scoring using FST Easier to correct misclassification in FST Possible to compose FSTs Most important, it’s really cool and fun… Down side? FST may be less accurate than the HMM Finite State Transducers

FST Approximation of HMM
How to approximate HMM by FST? We consider 2 methods known as n-type approximation s-type approximation These usually focused on “problem 2” That is, uncovering the hidden states This is the usual concern in NLP, such as “part of speech” tagging This “n-type” and “s-type” terminology comes from the paper, Finite state transducers approximating hidden Markov models, by A. Kempe. It does not seem to be in standard use. Finite State Transducers

n-type Approximation Let V be distinct observations in HMM
Let λ = (A,B,π) be a trained HMM Recall, A is N x N, B is N x M, π is 1 x N Let (input : output / weight) = (Vi : Sj / p) Where i  {1,2,…,M} and j  {1,2,…,N} And Sj are hidden states (rows of B) And weight is max probability (from λ) Examples later… Finite State Transducers

More n-type Approximations
Range of n-type approximations n0-type  only use the B matrix n1-type  see previous slide n2-type  for 2nd order HMM n3-type  for 3rd order HMM, and so on What is 2nd order HMM? Transitions depend on 2 consecutive states In 1st order, only depend on previous state Finite State Transducers

s-type Approximation “Sentence type” approximation
Use sequences and/or natural breaks In n-type, max probability over one transition using A and B matrices In s-type, all sequences up to some length Ideally, break at boundaries of some sort In NLP, sentence is such a boundary For malware, not so clear where to break So in malware, maybe just use a fixed length Finite State Transducers

HMM to FST Exact representation also possible Given model λ = (A,B,π)
That is, resulting FST is “same” as HMM Given model λ = (A,B,π) Nodes for each (input : output) = (Vi : Sj) Edge from each node to all other nodes… …including loop to same node Edges labeled with target node Weights computed from probabilities in λ Finite State Transducers

HMM to FST Note that some probabilities may be 0
Remove edges with 0 probabilities A lot of probabilities may be small So, maybe approximate by removing edges with “small” probabilities? Could be an interesting experiment… A reasonable way to approximate HMM that does not seem to have been studied Finite State Transducers

HMM Example Suppose we have 2 coins Observations? Hidden states?
1 coin is fair and 1 unfair Roll a die to decide which coin to flip We see resulting sequence of H and T We do not know which coin was flipped… …and we do not see the roll of the die Observations? Hidden states? Finite State Transducers

HMM Example Suppose probabilities are as given
Then what is λ = (A,B,π) ? 0.8 Hidden states: 0.9 fair unfair 0.2 0.1 0.5 0.5 0.7 0.3 Observations: H T H T Finite State Transducers

HMM Example HMM is given by λ = (A,B,π), where
This π implies we start in F (fair) state Also, state 1 is F and state 2 is U (unfair) Suppose we observe HHTHT Then probability of, say, FUFFU is πFbF(H)aFUbU(H)aUFbF(T)aFFbF(H)aFUbU(T) = 1.0(0.5)(0.1)(0.7)(0.8)(0.5)(0.9)(0.5)(0.1)(0.3) = Finite State Transducers

HMM Example We have And observe HHTHT A = B = π =
state score probability FFFFF FFFFU FFFUF FFFUU FFUFF FFUFU FFUUF FFUUU FUFFF FUFFU FUFUF FUFUU FUUFF FUUFU FUUUF FUUUU We have A = B = π = And observe HHTHT Probabilities in table Finite State Transducers

HMM Example So, most likely state sequence is Problem 1, scoring?
score probability FFFFF FFFFU FFFUF FFFUU FFUFF FFUFU FFUUF FFUUU FUFFF FUFFU FUFUF FUFUU FUUFF FUUFU FUUUF FUUUU So, most likely state sequence is FFFFF Solves problem 2 Problem 1, scoring? Next slide Problem 3? Not relevant here Finite State Transducers

HMM Example How to score sequence HHTHT ? Sum over all states
probability FFFFF FFFFU FFFUF FFFUU FFUFF FFUFU FFUUF FFUUU FUFFF FUFFU FUFUF FUFUU FUUFF FUUFU FUUUF FUUUU How to score sequence HHTHT ? Sum over all states Sum the “score” column in table: P(HHTHT) = Forward algorithm is way more efficient Finite State Transducers

n-type Approximation Consider the 2-coin HMM with
A = B = π = For each observation, only include the most probable hidden state So, only possible FST labels in this case… H:F/w1, H:U/w2, T:F/w3, T:U/w4 Where weights wi are probabilities Finite State Transducers

n-type Approximation Consider example
B = π = For each observation, most probable state Weight is probability H:F/0.45 2 H:F/0.5 1 H:F/0.45 T:F/0.45 T:F/0.45 T:F/0.5 3 Finite State Transducers

n-type Approximation Suppose instead…
B = π = Most probable state for each observation? Weight is probability H:U/0.42 2 H:U/0.35 T:F/0.20 1 T:F/0.30 H:F/0.30 T:F/0.25 T:F/0.30 3 4 H:F/0.30 Finite State Transducers

HMM as FST Consider 2-coin HMM where Then FST nodes correspond to…
A = B = π = Then FST nodes correspond to… Initial state Heads from fair coin, (H:F) Tails from fair coin (T:F) Heads from unfair coin (H:U) Tails from unfair coin (T:U) Finite State Transducers

HMM as FST Suppose HMM is specified by Then FST is… A = B = π = H:F
H:U H:U 2 5 H:F H:F T:F T:U 1 H:F T:F H:U T:U H:U H:F T:F T:U 3 4 T:F T:U T:F Finite State Transducers

HMM as FST This FST is boring and not very useful
Weights make it a little more interesting Computing the weights is homework… H:F H:U H:U 2 5 H:F H:F T:F T:U 1 H:F T:F H:U T:U H:U H:F T:F T:U 3 4 T:F T:U T:F Finite State Transducers

Why Consider FSTs? FST used as “translating machine”
Well-defined operations on FSTs Composition is an interesting example Can convert HMM to FST Either exact or approximation Approximations may be much simplified, but might not be as accurate Advantages of FST over HMM? Finite State Transducers

Why Consider FSTs? Scoring/translating faster with FST
Able to compose multiple FSTs Where FSTs may be derived from HMMs One idea… Multiple HMMs trained on malware (same family and/or different families) Convert each HMM to FST Compose resulting FSTs Finite State Transducers

Bottom Line Can we get best of both worlds? Other possibilities?
Fast scoring, composition with FSTs Simplify/approximate HMMs via FSTs Tweak FST to improve scoring Efficient training using HMMs Other possibilities? Directly compute an FST without HMM Or FST as first pass (e.g., disassembly?) Finite State Transducers

References A. Kempe, Finite state transducers approximating hidden Markov models J. R. Novak, Weighted finite state transducers: Important algorithms K. Striegnitz, Finite state transducers Finite State Transducers

Finite State Transducers

Similar presentations

Presentation on theme: "Finite State Transducers"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Finite State Transducers

Similar presentations

Presentation on theme: "Finite State Transducers"— Presentation transcript:

Similar presentations

About project

Feedback