root/branches/raggi/lib/protocols/buftok.rb

Revision 668, 5.2 kB (checked in by blackhedd, 2 years ago)

migrated version_0 to trunk

  • Property svn:keywords set to Id
Line 
1 # BufferedTokenizer - Statefully split input data by a specifiable token
2 #
3 # Authors:: Tony Arcieri, Martin Emde
4 #
5 #----------------------------------------------------------------------------
6 #
7 # Copyright (C) 2006-07 by Tony Arcieri and Martin Emde
8 #
9 # Distributed under the Ruby license (http://www.ruby-lang.org/en/LICENSE.txt)
10 #
11 #---------------------------------------------------------------------------
12 #
13
14 # (C)2006 Tony Arcieri, Martin Emde
15 # Distributed under the Ruby license (http://www.ruby-lang.org/en/LICENSE.txt)
16
17 # BufferedTokenizer takes a delimiter upon instantiation, or acts line-based
18 # by default.  It allows input to be spoon-fed from some outside source which
19 # receives arbitrary length datagrams which may-or-may-not contain the token
20 # by which entities are delimited.
21
22 class BufferedTokenizer
23   # New BufferedTokenizers will operate on lines delimited by "\n" by default
24   # or allow you to specify any delimiter token you so choose, which will then
25   # be used by String#split to tokenize the input data
26   def initialize(delimiter = "\n", size_limit = nil)
27     # Store the specified delimiter
28     @delimiter = delimiter
29
30     # Store the specified size limitation
31     @size_limit = size_limit
32
33     # The input buffer is stored as an array.  This is by far the most efficient
34     # approach given language constraints (in C a linked list would be a more
35     # appropriate data structure).  Segments of input data are stored in a list
36     # which is only joined when a token is reached, substantially reducing the
37     # number of objects required for the operation.
38     @input = []
39
40     # Size of the input buffer
41     @input_size = 0
42   end
43
44   # Extract takes an arbitrary string of input data and returns an array of
45   # tokenized entities, provided there were any available to extract.  This
46   # makes for easy processing of datagrams using a pattern like:
47   #
48   #   tokenizer.extract(data).map { |entity| Decode(entity) }.each do ...
49   def extract(data)
50     # Extract token-delimited entities from the input string with the split command.
51     # There's a bit of craftiness here with the -1 parameter.  Normally split would
52     # behave no differently regardless of if the token lies at the very end of the
53     # input buffer or not (i.e. a literal edge case)  Specifying -1 forces split to
54     # return "" in this case, meaning that the last entry in the list represents a
55     # new segment of data where the token has not been encountered
56     entities = data.split @delimiter, -1
57
58     # Check to see if the buffer has exceeded capacity, if we're imposing a limit
59     if @size_limit
60       raise 'input buffer full' if @input_size + entities.first.size > @size_limit
61       @input_size += entities.first.size
62     end
63    
64     # Move the first entry in the resulting array into the input buffer.  It represents
65     # the last segment of a token-delimited entity unless it's the only entry in the list.
66     @input << entities.shift
67
68     # If the resulting array from the split is empty, the token was not encountered
69     # (not even at the end of the buffer).  Since we've encountered no token-delimited
70     # entities this go-around, return an empty array.
71     return [] if entities.empty?
72
73     # At this point, we've hit a token, or potentially multiple tokens.  Now we can bring
74     # together all the data we've buffered from earlier calls without hitting a token,
75     # and add it to our list of discovered entities.
76     entities.unshift @input.join
77
78 =begin
79         # Note added by FC, 10Jul07. This paragraph contains a regression. It breaks
80         # empty tokens. Think of the empty line that delimits an HTTP header. It will have
81         # two "\n" delimiters in a row, and this code mishandles the resulting empty token.
82         # It someone figures out how to fix the problem, we can re-enable this code branch.
83     # Multi-character token support.
84     # Split any tokens that were incomplete on the last iteration buf complete now.
85     entities.map! do |e|
86       e.split @delimiter, -1
87     end
88     # Flatten the resulting array.  This has the side effect of removing the empty
89     # entry at the end that was produced by passing -1 to split.  Add it again if
90     # necessary.
91     if (entities[-1] == [])
92       entities.flatten! << []
93     else
94       entities.flatten!
95     end
96 =end
97
98     # Now that we've hit a token, joined the input buffer and added it to the entities
99     # list, we can go ahead and clear the input buffer.  All of the segments that were
100     # stored before the join can now be garbage collected.
101     @input.clear
102    
103     # The last entity in the list is not token delimited, however, thanks to the -1
104     # passed to split.  It represents the beginning of a new list of as-yet-untokenized 
105     # data, so we add it to the start of the list.
106     @input << entities.pop
107    
108     # Set the new input buffer size, provided we're keeping track
109     @input_size = @input.first.size if @size_limit
110
111     # Now we're left with the list of extracted token-delimited entities we wanted
112     # in the first place.  Hooray!
113     entities
114   end
115  
116   # Flush the contents of the input buffer, i.e. return the input buffer even though
117   # a token has not yet been encountered
118   def flush
119     buffer = @input.join
120     @input.clear
121     buffer
122   end
123
124   def empty?
125     @input.empty?
126   end
127 end
Note: See TracBrowser for help on using the browser.