Does the Internet “End” at 500K routes?

No! Of course, the Internet does not end at 500K routes. On August 13, 2014, there was a lot of “news” about instability issues on the Internet that might have been caused by a surge of new Internet routes (see articles like “Internet routers hitting 512K limit, some become unreliable” – http://arstechnica.com/security/2014/08/internet-routers-hitting-512k-limit-some-become-unreliable/). The most accurate write up can be found here:

What cause today’s Internet hiccup by Andree Toonk (http://www.bgpmon.net/what-caused-todays-internet-hiccup/)

Is this “instability” something to worry about? Yes! But please worry productively. What follows is a check list that is recommended for any organization that is connected to the Internet with their own Autonomous System Number (ASN).

First, please understand the real problem. One service provider “de-aggregated” thousands of routers and leaked them into the global routing table (see Andree’s post). Some routers that did not have enough forwarding memory could not store all these additional routes and became “unpredictable.” This resulted in some networks being disconnected from the Internet. Why did this happen? Routers and switches have forwarding tables that are used to “route packets” from one Interface to another. In modern routers, these forwarding tables use high-speed memory that allow for extremely fast lookups. We need these fast lookups to handle the 100G interfaces and packet per second forwarding speeds. If these high speed memory “overloads,” the router’s programing tried to keep some of the forwarding as normal, but passes the new routes to slow path (details vary between vendors). As a consequence, operators need to understand how their router behaves during these overloads.

Understand the key points:

De-aggregating “route leaks” will happen. While they are not normal, they will happen. Any ASN (network) that is connected to the Internet should prepare for route leaks.
The Internet is not coming to an end. In fact, the growth of the Internet route table is not forecasted to be of major concern over the next five years. Please download and watch Geoff Huston’s NANOG 60 talk “BGP in 2013” (https://www.nanog.org/meetings/abstract?id=2270). Geoff walks through an easy to understand analysis of the global Internet route table’s growth.
Do worry about malicious route leaks! There is little preventing someone to de-aggregate and inject routes into the Internet. Anyone connecting to the Internet must have this contingency as part of their routing policy.

This last point is the critical item. What can you do about it? Start with this “Check List” (or the conversation you need to have with your network engineer) …..

✓ Have you documented your router’s configuration? You would be surprised how many organizations have never saved a copy of their router’s configuration. Some will “screen scape” the configuration and save it. Others will use tools like Rancid to maintain an up to date copy. Still others will have tools that build the configuration offline and push the full configuration to the router. The key is to have an off line copy of the configuration. It is obvious, but 1/2 the operators that engage in “BGP consulting” cannot provide a current off-line copy of their configuration (they need to login and get a copy).

✓ Write down the inbound and outbound routing policy in plain English so that anyone in the company can understand. Gateway routers that connect to the global Internet have two policies. The first are the rules to accept routes from the Internet (inbound). These routes will govern the packets you send to the Internet. The second are the routes you send to the Internet (outbound). These govern how the Internet gets to your network. The most mistakes with the routing policy have a root cause with the way policy is expressed. Too many network engineers just write the BGP configuration without writing an over all policy. Writing the policy down before you configure a router is similar to flow charting before programing or writing the “test” in TDD (Test Driven Development) before coding world. Here is one example that uses the Routing Resilience Manifesto guidelines as a foundation for a multi-homed organization (two Internet connections):

Inbound Internet Route Policy (Example)

Only accept routes using the minimum practical allocation set by each Regional Internet Registry (RIR). We will filter all more specific routes. For example, the /24s inside the /19 will be filtered. Our two upstream providers will have the more specific routes. We just need the core aggregate route.
Drop all Documented Special Use Addresses (DSUA). We should never see 0.0.0.0 or 127.0.0.0 come to our network, but we need to filter to prevent malicious intent.
Set the Max Prefix Limit to alarm at 25% lower than the max number of prefixes that can be processed on our routers. If there is a prefix-leak on the Internet, we need to have an alarm to let us know what is happening. The SNMP trap from the BGP feature should go to the NOC and trigger an immediate escalation.
Consequences & Risk of “inaction”: Too many prefixes can overload the gateway router and cause network instability.

Outbound Internet Route Policy (Example)

Only advertise our prefixes to each of our upstream providers. Tag our advertisements with a BGP community.
Set an outbound prefix filter that explicitly permits only our prefixes. All other prefixes will be denied with a “deny all” and a log set on the deny. This will be used to spot issues with our outbound policy.
Set an outbound BGP community filter that only allows prefixes with the designated BGP community to be passed to our upstream providers. This is a “safe guard” filter in case the prefix filter is broken.
Set a Documented Special Use Addresses (DSUA) filter to ensure our network is not a problem without bound special use prefixes. It would be really bad to advertise “default” to the Internet.
Our outbound prefix list should only be the aggregate. More specifics should never exceed /24 (IPv4).
Consequences & Risk of “inaction”: Leaking routes to the Internet will cause unwanted traffic to be “pulled” into our network. This will cause a “self infected DDOS.”

This example routing policy can be turned into slides and explained to management, used with a vendor to create specific configurations, or used for team consultation on changes to the route policy. The key is to have something that many people can read, address, and consult. IOS or JUNOS configurations are not the type of “route policy” that facilitates consultation.

✓ Do you really need the “full Internet Routing Table?” When asked, most “multi-homed” enterprise networks will not be able to coherently give an explicit reason why they need full Internet routes on their gateway router. Most can live with partial routes or routes filtered to not accept the more specific routes. Edge enterprise network can save money (no upgrade of forwarding table memory) and reduce the risk (less chance of being hit with a prefix explosion attack).

✓ Get the empirical data from your router vendor – how many routes will the “chips” hold. The vendors need to supple empirical test on the number of routes their equipment can process. This needs to be engineering data. Expect the vendors to minimally comply with the guidelines set forward in the IETF’s Benchmarking Methodology Working Group (bmwg) (see http://datatracker.ietf.org/doc/draft-ietf-bmwg-bgp-basic-convergence/). The number of routes that can be safely processed in the router’s forward table will determine how the router is configured, where it is used, and when it would need to be upgrade/replaced.

Do not be distracted that this issue is a “Cisco” problem. The “problems” is when network engineers are not demanding details from their vendors to get an accurate dimensioning details and correct forecasting for when their routers need action (configuration, upgrade, or replacement).

✓ Know your Peers. “Do you have the phone numbers and E-mails of your upstream providers” is one of the first questions I ask of any enterprise dual homed. The majority answer “no.” This contract information needs to be on your phone, in your NOC, and tested at least once a quarter (contacts change). If you are connected to an Internet Exchange Point (IXP), then you need the contact information for everyone you peer with plus the IXP operator. Having accurate contact information is also true in the reverse. All these peers need your contact information. The community of engineers who maintain global connectivity will look after each other. They will call each other. But they need the numbers to call. Don’t wait for something to happen. Proactively get this information. The BGP instability issue on August 13, 2014 was primarily a non-issue for those networks who had the contact information in their address book.

✓ Sign up to the “BGP Reports.” The only way to really know what is going on with your BGP interconnectivity is to see your network from the inside and outside. Outside means using tools that monitor your network. These could range from commercial tools to academic projects. Start with these tools:

CIDR Report – http://www.cidr-report.org/as2.0/. Can view your how well you are aggregating.
Hurricane Electric’s BGP Toolkit – http://bgp.he.net/. Excellent tool to explore how the world sees your BGP advertisement.
BGP Mon – http://www.bgpmon.net/. Real time monitor that is free for the first free prefixes. This is perfect for the average multi-homed enterprise.

There are other tools, but these basic ones get people started on the right path.

The key objective is to ensure the network operations team is looking at the data on the global Internet routing table, how the organization impacts that table, and if there are things that can be done to protect the organization’s interest. Note that the “Internet’s well being” is in all organization’s interest.

Sign up to the appropriate Network Operations Group (NOG). The network engineers in your organization should be on the appropriate network operations groups. These groups are the first places people will bring up instability issues and problems that are impacting everyone. They are regionalized with various levels of participation. Look through the master list maintained by the North American Network Operations Group (NANOG) – https://www.nanog.org/resources/orgs. Sign up and set up a mail filter. Check the mailing list ever day or several times a day. If there is an instability problem with your Internet connection, check the NOG list to see if anything is going on with the Internet’s stability.

What if you do not have a local NOG? E-mail to bgreene@senki.org for help. We just started IDNOG (http://www.idnog.or.id/). The team was persistent and found there was plenty of people and organizations who would help.

Summary. No, the Internet is not in trouble (see Geoff Huston’s talk). What this incident should teach all network engineers is that they cannot take their routers that connect to the Internet for granted. If you are connected to the Internet through BGP, then due-diligence, monitoring, and good policy are needed to maintain a healthy connection to the Internet.

Related