attachment:olsr-story.txt di olsr

Allegato "olsr-story.txt"

   1 The OLSR.ORG story
   2 
   3 Proactive protocols (Link State Routing Protocols) generate a lot of
   4 overhead because they have to keep topoloy information and routing tables in
   5 sync amongst all or at least amongst adjacent nodes. If the protocol does
   6 not manage to keep the routing tables synced it is likely that the payload
   7 will spin in routing loops until the TimeToLive (TTL) is expired. Apart from
   8 high traffic-overhead and CPU-Load this is the biggest issue for Link State
   9 Routing Protocols.
  10 
  11 We were actively involved in the evolution of olsrd from olsr.org. Actually
  12 we were the people that made it functional. RFC3626 - the initial IETF-draft
  13 of olsr - does not work in real life. If you want to find out what it's
  14 developers intended it to be and how it should work, I would like to suggest
  15 reading the RFC3626 after you have seen the presentation of Andreas
  16 Tøennesen on the OLSR.ORG website about RFC3626.
  17 
  18 
  19 We heavily modified olsr over the time. We disabled almost everything that
  20 the inital designers of olsr thought was smart and replaced it with the
  21 LQ/ETX-Mechanism and Fish-Eye Mechanism tp update topology information.
  22 
  23 
  24 
  25 What we did to improve olsr (in historical order):
  26 
  27 * Test OLSR according to RFC3626 at the conference Wizards of OS III in 2004
  28   - Meshcloud with 25 Nodes
  29 
  30 	Results:
  31 
  32 	Routing tables take long time to build and no time to break down.
  33 	Routes flap.
  34 	Routing loops.
  35 	No throughput.
  36 	Gateway switches all the time - so stateful applications will brake 
  37 	down all the time
  38 
  39 
  40 
  41 	Conclusion:
  42 
  43 	Hysteresis mechanism frequently kicks Multi Point Relays (MPRs) out 
  44 	of the routing table --> Infrastructure to broadcast topology information 
  45 	breaks down all the time and MPRs have to be negotiated again...
  46 
  47 	Multipoint relay selection selects nodes far away to keep the number of 
  48 	necessary Multi Point Relays low --> Links to MPRs are weak, so hysteresis 
  49 	kicks them out of the routing table more often than not
  50 	
  51 	Multipoint relay selection reduces protocol overhead and prevents topology 
  52 	information from being in sync --> Routing loops
  53 	
  54 	Routes are unstable --> No throughput
  55 	
  56 	Routes selected on minimum Hop-count maximises packetloss --> No throughput
  57 
  58 	Routing loops --> No throughput
  59 
  60 	Dynamic gateway selection --> Stateful connections get interrupted when a 
  61 	different gateway is selected
  62 
  63 
  64 
  65 	What we did:
  66 	Disable hysteresis.
  67 	Disable MPRs - all nodes forward topology information.
  68 
  69 
  70 
  71 
  72 Now almost everything that was meant to optimize Link State Routing was
  73 disabled - a simple proactive link-state routing protocol with support for
  74 multiple interfaces was all that was left. We started to deploy OLSR in the
  75 Freifunk Mesh in Berlin - rather we should have named it LSR back then. But
  76 since the implementation came from olsr.org and everything could be switched
  77 on and off by the configuration file we didn't think about starting a new
  78 project and renaming it. This became later a source of confusion and
  79 disappointment for all people that tried olsr.org and had no idea what was
  80 going on in Berlin. If you use the standard configuration file that is
  81 shipped with olsr.org, olsrd will still behave according to RFC3626. So if
  82 you want to see how miserable RFC3626 works - try it with the default
  83 configuration file.
  84 
  85 
  86 
  87 
  88 * Deployment of OLSR (with 'Optimizations' removed) in the Berlin Freifunk
  89   mesh cloud - 2004 
  90 
  91 	Results:
  92 	Works much better than RFC3626. Still it was hardly usable.
  93 	Throughput very low and unstable.
  94 	Routing table doesn't break down anymore
  95 	Dynamic gateway selection --> Stateful connections get interrupted 
  96 	when a different gateway is selected
  97 	
  98 
  99 	Conclusion:
 100 	We knew routes based on minimum hopcount will likely have very 
 101 	low throughput. 
 102 	Dynamic gateway selection is a tradeoff of automatic gateway 
 103 	selection by the protocol
 104 
 105 
 106 	
 107 I knew from my first experience with mobilemesh (another Link State Routing
 108 Protocol that we tried at the beginning of the Freifunk Mesh) that minimum
 109 hop count sucks completely as an algorithm to select routes. So I started to
 110 think about routes that are chosen based on metrics measuring the
 111 quality/throughput of links. I decided to use a metric based on packet loss
 112 and found the idea of ETX (Expected Transmission Count) in a paper written
 113 at the MIT. I didn't like the way they suggested to implement it (sending
 114 extra UDP packets), so I developed the idea of ETX/LQ together with Thomas
 115 Lopatic who implemented the new ideas in olsrd. Rather than sending extra
 116 UDP-packets we could just keep track of missed or successfully transmitted
 117 Hello-Packets which are frequently broadcasted by the LSR-mechanism anyway.
 118 And we could send the information about the successfully received packets
 119 within the Hello messages - so neighbors are updated immediately with every
 120 "Hello" broadcast how their neighbor thinks about their own transmissions.
 121 
 122 This was a lot of work in the code of olsrd and Thomas did a cumbersome but
 123 really great job. I had the feeling that this would be a major milestone on
 124 the way to a good working protocol. It was released as olsr-0.48 - we had a
 125 nice party and a big barrel of beer at the c-base to celebrate the moment :) 
 126 There was one tradeoff, however. We had to break compatibility with
 127 RFC3626. But since RFC3626 wasn't usable in real-life we didn't bother much.
 128 
 129 
 130 * Deployment of olsr-0.48 in the Freifunk-Mesh with ETX/LQ-Mechanism
 131 
 132 	Results:
 133 	Probably bugs in the huge amount of new program-code
 134 	
 135 	Good routing decisions on wireless links operating at the same speed 
 136 	as long as the network is idle
 137 	
 138 	Throughput improved - but throughput is interrupted by routing loops 
 139 	as soon as heavy network load is introduced
 140 	
 141 	Payload runs for a while at high speed, then the traffic is interrupted, 
 142 	comes back after a while at slow speed - caused by routing loops
 143 	
 144 	Dynamic gateway selection --> Stateful connections get interrupted 
 145 	when a different gateway is selected
 146 
 147 	Conclusion:
 148 	This was a mayor improvement, but... 
 149 	Payload causes interference and alters LQ/ETX-Values - interference causes 
 150 	lost LQ-Messages, so LQ/ETX-Values in topology messages change when
 151 	traffic is introduced. If the protocol fails to update the link state 
 152 	information in time the routing tables are not in sync - thus causing
 153 	 routing loops.
 154 
 155 Freifunk OLSR-Networks in other cities that had relatively low traffic
 156 compared to the capacity of their wireless links were quite happy with 0.48.
 157 But networks where links got saturated were still unstable.
 158 
 159 Now it became even more clear how stupid the idea of Multipoint-Relays was.
 160 Traffic causes interference and lost topology update messages. So the link
 161 quality values change - and the information that tells the nodes in the mesh
 162 about the topology changes are likely to get lost on their way. MPRs reduce
 163 the redundancy that is desperately needed to keep routing tables in sync.
 164 And - even worse - the information about who is whose MPR is another
 165 information that has to be synced. Another source of failure.
 166 
 167 So we had to find a way to make sure that information about topology changes
 168 is updated in time to avoid routing loops. A perfect routing table that only
 169 works as long as the network is idle is quite useless...
 170 
 171 One viable solution in a small mesh would be to send topology control
 172 messages (TC-Messages) more often than Hello's - but we already had a mesh
 173 with more than 100 nodes, so the traffic caused by redundant TC-Messages
 174 would suffocate the network by introducing massive overhead. Than we had the
 175 idea of sending TC-Messages with different TTL (Time-To-Live) values. I had
 176 the hypothesis that routing loops would occur amongst adjacent nodes - so we
 177 would only have to update topology changes quickly and redundant amongst
 178 adjacent nodes.
 179 
 180 We had to design an algorithm that would make sure that adjacent nodes have
 181 correct topology information - but the problem is that it seemingly would
 182 not work without massive overhead.
 183 
 184 The idea we came up with is to send TC messages to adjacent nodes very
 185 often, i.e. nodes that are likely to be involved in routing loops, without
 186 flooding the whole mesh with each sent TC message. We called it Link Quality
 187 Fish Eye mechanism.
 188 
 189 OLSR packets carry a Time To Live (TTL) that specifies the maximum number of
 190 hops that the packets is allowed to travel in the mesh. The Link Quality
 191 Fish Eye mechanism generates TC messages not only with the default TTL of
 192 255, but with different TTLs, namely 1, 2, 3, and 255, restricting the
 193 distribution of TC messages to nodes 1, 2, 3, and 255 hops away. A TC
 194 message with a TTL of 1 will just travel to all one-hop neighbours, a
 195 message with a TTL of 2 will in addition reach all two-hop neighbours, etc.
 196 TC messages with small TTLs are sent more frequently than TC messages with
 197 higher TTLs, such that immediate neighbours are more up to date with respect
 198 to our links than the rest of the mesh.
 199 
 200 The following sequence of TTL values is used by olsrd.
 201 
 202         255 3 2 1 2 1 1 3 2 1 2 1 1
 203 
 204 Hence, a TC interval of 0.5 seconds leads to the following TC
 205 broadcast scheme.
 206 
 207   * Out of 13 TC messages, all 13 are seen by one-hop neighbours (TTL
 208     1, 2, 3, or 255), i.e. a one-hop neighbour sees a TC message every
 209     0.5 seconds.
 210 
 211   * Two-hop neighbours (TTL 2, 3, or 255) see 7 out of 13 TC messages,
 212     i.e. about one message per 0.9 seconds.
 213 
 214   * Three-hop neighbours (TTL 3 or 255) see 3 out of 13 TC messages,
 215     i.e. about one message per 2.2 seconds.
 216 
 217   * All other nodes in the mesh (TTL 255) see 1 out of 13 TC messages,
 218     i.e. one message per 6.5 seconds.
 219 
 220 The sequence of TTL values is hardcoded in lq_packet.c and can be altered
 221 easily for further experiments.
 222 
 223 The implementation of Link Quality Fish Eye mechanism took Thomas only a few
 224 minutes - and it was the second major improvement. Thomas also introduced a
 225 new switch, called LinkQualityDjikstraLimit. The slow CPUs of embedded
 226 routers have serious problems to recalculate the routing tables in a Mesh
 227 with more than 100 nodes. Every incoming TC-Message would trigger another
 228 recalculation of the Djikstra-Table - this would be far too often.
 229 LinkQualityDjikstraLimit allows to set an interval for recalculating the
 230 Djikstra-Table.
 231 
 232 * Deployment of olsr-0.4.10 
 233 
 234 		Results:
 235 		Now it is really working and usable :)
 236 		
 237 		It's still not absolutely loop-free
 238 		
 239 		Multihop-Links with 10 Hops work and are stable as long 
 240 		as the wireless links work.
 241 		
 242 		LinkQualityDjikstraLimit allows to run olsr even on a 
 243 		realtively slow CPU in a big Mesh-Cloud -
 244 		but the routing-table becomes very very static
 245 		
 246 		Gateway-Switching is still a constant annoyance if a mesh 
 247 		has more than one Internet-Gateway
 248 
 249 		Conclusions:
 250 		Apart from the problems with Gateway-Switching it is now a 
 251 		well behaving routing protocol
 252 
 253 
 254 But still... Thomas and I agreed that we could cope with the increasing size
 255 of the Freifunk-Networks only by making the protocol more and more static.
 256 Link State Routing has significant design flaws. Why does every node
 257 calculate a whole routing table from every node to every node - if all it
 258 can do is decide which direct neighbor it sends the packet to? Synchronized
 259 Link State Information is impossible to achieve in a wireless network. Why
 260 let every CPU calculate unneccessary information? Link State Routing thinks
 261 too much and is far too complex for is own good. Why do all this?
 262 
 263 
 264 We decided to come up with something better. Thomas had the idea for a name:
 265 B.A.T.M.A.N - Better Approach To Mobile Ad-Hoc Networking.
Nuovo allegato

Allegati

Per riferirsi agli allegati di una pagina, usare attachment:NOME_FILE, come mostrato qui sotto nell'elenco degli allegati. NON usare l'URL che si trova in corrispondenza del collegamento [scarica], potrebbe cambiare in futuro.
[scarica | mostra] (2009-02-05 12:51:07, 11.2 KB) [[attachment:olsr-story.txt]]