1 The OLSR.ORG story
3 Proactive protocols (Link State Routing Protocols) generate a lot of
4 overhead because they have to keep topoloy information and routing tables in
5 sync amongst all or at least amongst adjacent nodes. If the protocol does
6 not manage to keep the routing tables synced it is likely that the payload
7 will spin in routing loops until the TimeToLive (TTL) is expired. Apart from
8 high traffic-overhead and CPU-Load this is the biggest issue for Link State
9 Routing Protocols.
11 We were actively involved in the evolution of olsrd from olsr.org. Actually
12 we were the people that made it functional. RFC3626 - the initial IETF-draft
13 of olsr - does not work in real life. If you want to find out what it's
14 developers intended it to be and how it should work, I would like to suggest
15 reading the RFC3626 after you have seen the presentation of Andreas
16 Tøennesen on the OLSR.ORG website about RFC3626.
19 We heavily modified olsr over the time. We disabled almost everything that
20 the inital designers of olsr thought was smart and replaced it with the
21 LQ/ETX-Mechanism and Fish-Eye Mechanism tp update topology information.
25 What we did to improve olsr (in historical order):
27 * Test OLSR according to RFC3626 at the conference Wizards of OS III in 2004
28 - Meshcloud with 25 Nodes
32 Routing tables take long time to build and no time to break down.
33 Routes flap.
34 Routing loops.
35 No throughput.
36 Gateway switches all the time - so stateful applications will brake
37 down all the time
43 Hysteresis mechanism frequently kicks Multi Point Relays (MPRs) out
44 of the routing table --> Infrastructure to broadcast topology information
45 breaks down all the time and MPRs have to be negotiated again...
47 Multipoint relay selection selects nodes far away to keep the number of
48 necessary Multi Point Relays low --> Links to MPRs are weak, so hysteresis
49 kicks them out of the routing table more often than not
51 Multipoint relay selection reduces protocol overhead and prevents topology
52 information from being in sync --> Routing loops
54 Routes are unstable --> No throughput
56 Routes selected on minimum Hop-count maximises packetloss --> No throughput
58 Routing loops --> No throughput
60 Dynamic gateway selection --> Stateful connections get interrupted when a
61 different gateway is selected
65 What we did:
66 Disable hysteresis.
67 Disable MPRs - all nodes forward topology information.
72 Now almost everything that was meant to optimize Link State Routing was
73 disabled - a simple proactive link-state routing protocol with support for
74 multiple interfaces was all that was left. We started to deploy OLSR in the
75 Freifunk Mesh in Berlin - rather we should have named it LSR back then. But
76 since the implementation came from olsr.org and everything could be switched
77 on and off by the configuration file we didn't think about starting a new
78 project and renaming it. This became later a source of confusion and
79 disappointment for all people that tried olsr.org and had no idea what was
80 going on in Berlin. If you use the standard configuration file that is
81 shipped with olsr.org, olsrd will still behave according to RFC3626. So if
82 you want to see how miserable RFC3626 works - try it with the default
83 configuration file.
88 * Deployment of OLSR (with 'Optimizations' removed) in the Berlin Freifunk
89 mesh cloud - 2004
92 Works much better than RFC3626. Still it was hardly usable.
93 Throughput very low and unstable.
94 Routing table doesn't break down anymore
95 Dynamic gateway selection --> Stateful connections get interrupted
96 when a different gateway is selected
100 We knew routes based on minimum hopcount will likely have very
101 low throughput.
102 Dynamic gateway selection is a tradeoff of automatic gateway
103 selection by the protocol
107 I knew from my first experience with mobilemesh (another Link State Routing
108 Protocol that we tried at the beginning of the Freifunk Mesh) that minimum
109 hop count sucks completely as an algorithm to select routes. So I started to
110 think about routes that are chosen based on metrics measuring the
111 quality/throughput of links. I decided to use a metric based on packet loss
112 and found the idea of ETX (Expected Transmission Count) in a paper written
113 at the MIT. I didn't like the way they suggested to implement it (sending
114 extra UDP packets), so I developed the idea of ETX/LQ together with Thomas
115 Lopatic who implemented the new ideas in olsrd. Rather than sending extra
116 UDP-packets we could just keep track of missed or successfully transmitted
117 Hello-Packets which are frequently broadcasted by the LSR-mechanism anyway.
118 And we could send the information about the successfully received packets
119 within the Hello messages - so neighbors are updated immediately with every
120 "Hello" broadcast how their neighbor thinks about their own transmissions.
122 This was a lot of work in the code of olsrd and Thomas did a cumbersome but
123 really great job. I had the feeling that this would be a major milestone on
124 the way to a good working protocol. It was released as olsr-0.48 - we had a
125 nice party and a big barrel of beer at the c-base to celebrate the moment :)
126 There was one tradeoff, however. We had to break compatibility with
127 RFC3626. But since RFC3626 wasn't usable in real-life we didn't bother much.
130 * Deployment of olsr-0.48 in the Freifunk-Mesh with ETX/LQ-Mechanism
133 Probably bugs in the huge amount of new program-code
135 Good routing decisions on wireless links operating at the same speed
136 as long as the network is idle
138 Throughput improved - but throughput is interrupted by routing loops
139 as soon as heavy network load is introduced
141 Payload runs for a while at high speed, then the traffic is interrupted,
142 comes back after a while at slow speed - caused by routing loops
144 Dynamic gateway selection --> Stateful connections get interrupted
145 when a different gateway is selected
148 This was a mayor improvement, but...
149 Payload causes interference and alters LQ/ETX-Values - interference causes
150 lost LQ-Messages, so LQ/ETX-Values in topology messages change when
151 traffic is introduced. If the protocol fails to update the link state
152 information in time the routing tables are not in sync - thus causing
153 routing loops.
155 Freifunk OLSR-Networks in other cities that had relatively low traffic
156 compared to the capacity of their wireless links were quite happy with 0.48.
157 But networks where links got saturated were still unstable.
159 Now it became even more clear how stupid the idea of Multipoint-Relays was.
160 Traffic causes interference and lost topology update messages. So the link
161 quality values change - and the information that tells the nodes in the mesh
162 about the topology changes are likely to get lost on their way. MPRs reduce
163 the redundancy that is desperately needed to keep routing tables in sync.
164 And - even worse - the information about who is whose MPR is another
165 information that has to be synced. Another source of failure.
167 So we had to find a way to make sure that information about topology changes
168 is updated in time to avoid routing loops. A perfect routing table that only
169 works as long as the network is idle is quite useless...
171 One viable solution in a small mesh would be to send topology control
172 messages (TC-Messages) more often than Hello's - but we already had a mesh
173 with more than 100 nodes, so the traffic caused by redundant TC-Messages
174 would suffocate the network by introducing massive overhead. Than we had the
175 idea of sending TC-Messages with different TTL (Time-To-Live) values. I had
176 the hypothesis that routing loops would occur amongst adjacent nodes - so we
177 would only have to update topology changes quickly and redundant amongst
178 adjacent nodes.
180 We had to design an algorithm that would make sure that adjacent nodes have
181 correct topology information - but the problem is that it seemingly would
182 not work without massive overhead.
184 The idea we came up with is to send TC messages to adjacent nodes very
185 often, i.e. nodes that are likely to be involved in routing loops, without
186 flooding the whole mesh with each sent TC message. We called it Link Quality
187 Fish Eye mechanism.
189 OLSR packets carry a Time To Live (TTL) that specifies the maximum number of
190 hops that the packets is allowed to travel in the mesh. The Link Quality
191 Fish Eye mechanism generates TC messages not only with the default TTL of
192 255, but with different TTLs, namely 1, 2, 3, and 255, restricting the
193 distribution of TC messages to nodes 1, 2, 3, and 255 hops away. A TC
194 message with a TTL of 1 will just travel to all one-hop neighbours, a
195 message with a TTL of 2 will in addition reach all two-hop neighbours, etc.
196 TC messages with small TTLs are sent more frequently than TC messages with
197 higher TTLs, such that immediate neighbours are more up to date with respect
198 to our links than the rest of the mesh.
200 The following sequence of TTL values is used by olsrd.
202 255 3 2 1 2 1 1 3 2 1 2 1 1
204 Hence, a TC interval of 0.5 seconds leads to the following TC
205 broadcast scheme.
207 * Out of 13 TC messages, all 13 are seen by one-hop neighbours (TTL
208 1, 2, 3, or 255), i.e. a one-hop neighbour sees a TC message every
209 0.5 seconds.
211 * Two-hop neighbours (TTL 2, 3, or 255) see 7 out of 13 TC messages,
212 i.e. about one message per 0.9 seconds.
214 * Three-hop neighbours (TTL 3 or 255) see 3 out of 13 TC messages,
215 i.e. about one message per 2.2 seconds.
217 * All other nodes in the mesh (TTL 255) see 1 out of 13 TC messages,
218 i.e. one message per 6.5 seconds.
220 The sequence of TTL values is hardcoded in lq_packet.c and can be altered
221 easily for further experiments.
223 The implementation of Link Quality Fish Eye mechanism took Thomas only a few
224 minutes - and it was the second major improvement. Thomas also introduced a
225 new switch, called LinkQualityDjikstraLimit. The slow CPUs of embedded
226 routers have serious problems to recalculate the routing tables in a Mesh
227 with more than 100 nodes. Every incoming TC-Message would trigger another
228 recalculation of the Djikstra-Table - this would be far too often.
229 LinkQualityDjikstraLimit allows to set an interval for recalculating the
232 * Deployment of olsr-0.4.10
235 Now it is really working and usable :)
237 It's still not absolutely loop-free
239 Multihop-Links with 10 Hops work and are stable as long
240 as the wireless links work.
242 LinkQualityDjikstraLimit allows to run olsr even on a
243 realtively slow CPU in a big Mesh-Cloud -
244 but the routing-table becomes very very static
246 Gateway-Switching is still a constant annoyance if a mesh
247 has more than one Internet-Gateway
250 Apart from the problems with Gateway-Switching it is now a
251 well behaving routing protocol
254 But still... Thomas and I agreed that we could cope with the increasing size
255 of the Freifunk-Networks only by making the protocol more and more static.
256 Link State Routing has significant design flaws. Why does every node
257 calculate a whole routing table from every node to every node - if all it
258 can do is decide which direct neighbor it sends the packet to? Synchronized
259 Link State Information is impossible to achieve in a wireless network. Why
260 let every CPU calculate unneccessary information? Link State Routing thinks
261 too much and is far too complex for is own good. Why do all this?
264 We decided to come up with something better. Thomas had the idea for a name:
265 B.A.T.M.A.N - Better Approach To Mobile Ad-Hoc Networking.
, as shown below in the list of files. Do
link, since this is subject to change and can break easily.