MAT Working Group
10 May 2017
2 p.m.:
CHAIR: Good afternoon. Welcome to the MAT Working Group. I think we have the last people are finding their seat or the other room because they want to talk about IPv6 instead. Anybody else who wants to talk to IPv6 it's in the other room. This one is MAT. We have a busy agenda today, we are going to start with a little welcome and the little admintive things about the Working Group, I believe that the minutes from the last Working Group were posted on the mailing list a while ago and you can find them on the website as well. For today, we have a scribe, Suzanne, and we have Michl lay is monitoring the Jabber channel, so people can ask questions on Jabber and I will be chairing. Remember when we have questions and want to go to the microphone, state your name and your affiliation, and do one question at a time and then let the next persons come in in line get to the microphone, not the telephone, we are not talking about that.
And then, is there anybody here in the room who have any comments for the minutes from the last meeting? This is the time to bring it up. No. Awesome. Finally, we have the agenda. We are going to start with Fast.com and then talk a little bit about v6 health metrics, after that come to RIPE Atlas news, TraceMon and a news from a Russian hackathon, any comments on the agenda? Now is the time. No. Awesome. Very good. Then we are going to start out with Sergy who is going to talk about Fast.com.
SERGEY FEDOROV: Hello everyone. Today I am going to talk about Fast.com, the Internet speed test that Netflix has launched about a year ago but first let me introduce myself. I am new to RIPE and, I am a software engineer working for open connect group at Netflix, is the CDN that powers all of the Netflix's video traffic. I have been there for four‑and‑a‑half years working on platform and infrastructure, I build monitoring system for open connect, I worked on number of QOE, I am domain engineer behind Fast.com and currently I am engineering on a team that does their acceleration of dynamic requests that powers Netflix.
In my spare time I brew beer and like to drink it. I am a big fan of the great American barbecue and examining back to my roots, I am originally from Russia and do place ice hockey I combine those activities quite regularly on the same day.
So before I go into the Fast.com, let me draw distinctions between two tools in their intranet measurement space, that Netflix has right now. Many of you know about the ISP speed index, which is the monthly report that ranks ISPs based on the streaming quality of Netflix customers. The number that goes into index is ‑‑ bit rate based on Netflix traffic. The number that you see on Fast.com has nothing to do with ISP speed index. Fast.com is a generic ISP Internet speed measurement tool, it ‑‑ it's available for Netflix users and non‑useers and the thing in common with it's powered by Netflix infrastructure.
So what were the goals for us to build speed test? We wanted to have something that has simple T U can be understood by regular users who may not have a good idea how Internet works, we wanted it to be very lightweight, fast and reliable, have wide adoption for the platforms and we wanted to have a test that would represent a real for people consuming content from Internet. Why do you think that Netflix is well positioned to do that? Well first, we have quite a lot of experience in the field, and it happens that Internet traffic contributes to up to 35% of downstream connectivity in North America and now we are a global company and our traffic is usually in the top Internet ‑‑ in the ‑‑ in the top traffic levels in all countries. And we have the global infrastructure to serve Netflix content. More about that. The big part of, the main component that have infrastructure related to Fast.com is our CDN called open connect. The open connect consists of thousands of servers that contain a bunch of hardwares and optimised to serve enormous amount of traffic for Netflix videos. We deploy those servers all around the world, we have multiple locations on pretty much every continent except Antarctica, and we installed those boxes either in IX locations and we managed those services ourselves or we ship those servers for free to the ISPs they have enough Netflix traffic, they plug it in and it starts during Netflix bit.
On the Cloud side we have a service that steers the client to open connect appliances, this service takes into consideration their locations about client and servers, the current network conditions, availability of content with a goal to choose the most optimal path for traffic to go from the server to the client. And this is a team behind open connect. Around 100 members, all across different organisations like logistics, business and engineering, network, etc.. and we have 70 members at RIPE today. For more open connect information, please find any of us and we will talk more and I will go to Fast.com.
So this is how it looks, we have made a decision to simplify the U X as much as possible, we have decided to show the single number, which is download speed. We, despite the fact that many people in the room might have arguments that one number is not enough, think from the side of someone who does not know about their ‑‑ about Internet, multiple locations latency upload and download, create the expectations from network to one number. We will download speed is what we want to show, we have made their U X scaleable, very easy to use and given that it's HT M will on Jaffe script supported on all browsers that are currently supported and also on way beyond their official support date.
And we have anecdotal evidence of Fast.com tests from cameras, smart watches, I am looking forward to see the test from the micro wave. So while we have a very simple UI and single number there is a bit of logic that happens behind. First, we need to get the statistics for the test, it's lightweight, everything included in the graphics is just 25 kilobytes, we deployed it on CDN and once we load it the first request goes always to steering service in the Cloud and we look at location of clients and servers and try to find the best path for the traffic and return the allocations of the servers. We use IPv6 whenever it's possible, it's pretty much based on the client configuration, and every open connect appliance has 25 megabyte test file, we can generate URL to that file and we can download either the whole file or any portion of that file given the start and end range. The URLs are assigned, we include exploration into their ** hash bugin. So first that the URLs are unique and they cannot be reused beyond initial TTL.
Then as we have the URLs we can start the test. Let me dig, a bit more, into the structure of the test client. We have build it in modular fashion, the centre component is the test engine itself, it orchestrates the test and it opens connections, collects the data, aggregates it, it includes the logic to combine multiple probes into the single value and component that decides when ‑‑ whether we want to stop the test or keep measuring. On lower level we have plugable connectors which allow us to use different type of interface for different devices, in the browsers that interface would be X HR sockets ‑‑ we have a way to plug the that Itive Czechs on mobile applications, we can implement the test same logic with their request model from ‑‑ and we have integrated measuring functionality for smart TVs or game consoles, those devices happen to have their JavaScript excuse engine that allow that to integrate the measurements into the network diagnostic screen within Netflix applications. The test engine accomplishes a number of events on their ‑‑ on event bus and either UIs and metric collection component or outside application can listen to those events and respond. The events could be information about test being started or stopped or paused, low level connection probes, or information about intermediate results.
So for the measurement we start with the single connection, and we ‑‑ once we probe the network, we decide whether we want to add more, currently we use half mega second threshold to open second and one bandwidth per second to open the third one, and all the thresholds and number of Czechs is something we keep playing with because there are several cases that where the results would differ based on number of connections and how you approach this.
The main situation, in our case, is to avoid using too many connections, on their slow or constrained networks would not compete for the traffic, but on faster connections we want to saturate the network faster and reduce the time of the test. Once we have multiple probes for each connection we want to aggregate those and there are few things we want to decide: In our case we decided to only measure the good put ‑‑ we don't include the protocol overhead or ‑ re‑transmit the number, the main ‑‑ the reason why is because it's quite complex to do on client side. Depending on the type of network you run on, the protocol overhead will be different and in our opinion we cannot use single multiplier to adjust for that. But we exclude the TCP ramp‑up because we don't think it's good indication of their connectivity and the duration of the TCP ramp‑up we defined ‑‑ currently we are looking at their ‑‑ at the moving average of the last five measurements of ‑‑ aggregate measurements from all Czechs and when this stops increasing we decide we are past the TCP rampup and after that point is included in the final result. Past the TCP ramp up, we include every single measurement into the final results, we don't exclude any of their bottom or top results. We are looking at the domain and how this hardware changes. I can, we decided not to have a static duration for the test, currently it can run from 7 seconds to dozens of seconds, based on their type of the network. We are currently looking at their last seven values of their average value, and we expecting those last seven to be within 2% of each other, if maximum difference above 2% we keep testing. With that approach, if the network is very stable, we will get to the stable number pretty quickly and the test will finish in a few seconds but if we have some lows or instability we are going to be ups and downs which will compensate each other and will average out to the single measurement.
So that is in a nutshell how it's done. I omitted some other details for the sake of time, I will be happy to discuss those after the talk in the hallway. As you can see this has nothing to do with streaming or video or bit rates, we are just trying to measure the network.
I am going to share some stats about the usage. Over all, in general, there is always the question can we measure multiple gigabit per second speeds in the browser. For observations you can, we have seen multiple results above gigabit per second speeds, the top one we can conclude reliably because there were enough samples is 3.2 gig from some location in Singapore. But of course most of the measurements are below that number because of the capabilities of the devices, we have ‑‑ we were pretty happy with the adoption and the reach of the test, we cover about 70% of IPv4 ‑‑ on IPv4 and about 25% on IPv6. And country‑wide, we have about, data from about 70% of the countries with more than 10,000 tests and more than 40% of the countries with more than 100,000 tests.
Given that U X is very simple and has no dependencies we assume that is one of the primary reasons why it's very popular on mobile devices, about 50% of over all test volume on tablets and mobile and if you look at those tiny little spikes those correspond to the weekends which is probably just natural usage of devices on the weekends, you are further away from the desktop and tend to probe the network with the device in your pocket.
And since we are hand out IPv6 and run the test over IPv6 whenever possible this is some interesting stats of IPv6 adoption as we see from Fast.com, as we can see countrywide there is a pretty big ‑‑ over all we haven't seen any country with a meaningful amount of tests above 50% on IPv6.
So to conclude, we have been doing it for about a year, we don't consider that this project to be done, we are learning, we are investigating the corner cases and we are constantly asking ourself whether we have the best method to measure. Over all we are pretty happy with the adoption, it looks like the simplicity of the U X worked out pretty well for the word of mouth and the usage grows 10 to 15% month over month without us doing anything to promote it but we want to ‑‑ we keep advising the measurement methodology in terms of using different number of connections, the way we decide we want to stop the test and in general we are interested to see how we can collect some of the lower level network metrics from the servers and there could be some interesting insights we could get from it by correlating the browser and server side of things. It would be interesting to correlate some of the numbers with third party data and if anyone has any ideas to integrate with some probe measurement platform or any advanced tools I will be happy to chat more about it.
I am going to move to questions, but before that, let me answer three questions you might have in mind. So we measure download speed but most of you think okay, latency, upload, jitter, all this stuff, and I am with you but the thing is 99% of Internet users do not care about it. At the same time, we don't want to exclude the other 1% as well, and we are planning to add those advanced features but we also want to be very careful this terms of the U X. Is there an option to embed Fast.com test on third party website. Currently we don't have an option, we are collecting ideas so if you have usage scenarios please reach out and let us know. And as for the test results, as of right now we don't have a way to give raw data.
And so I assume I have three or five minutes for questions.
AUDIENCE SPEAKER: Thomas from DFN. The question is you probably measuring vantage TCP stack, did you already go into trying out quick?
SERGEY FEDOROV: The question is whether we played with quick for the measurements. Unfortunately, currently we cannot, from the browser, we cannot reliably use quick and on their server side we don't have implementation for quick at the moment. We are well familiar with the protocol and we might want to play with it at some point but we don't have any data at the moment.
AUDIENCE SPEAKER: Okay.
AUDIENCE SPEAKER: Maxim ‑‑ I went on to your site and tried to test this. Is it possible from user site is it IPv4 or IPv6, will you plan to implement some tick about this test for IPv6?
SERGEY FEDOROV: Currently also no, no way to see it in the UI, for 90% would not not know what IPv4 or IPv6 is, we are planning to introduce advanced features, they will be available for people who care and we will plan to include that as well.
GEOFF HUSTON: APNIC. It's your server that is doing the downloading. What is the TCP flow control algorithm that you are using on your side?
SERGEY FEDOROV: So you are asking whether control method that we will use.
GEOFF HUSTON: The particular algorithm you are using, yes.
SERGEY FEDOROV: We have very strong transport team and we may run multiple versions in production at an even point so we use the ‑‑ experiment with BBR, we have our own implementation of the congestion control and many more that might be in production in parallel.
GEOFF HUSTON: Well this kind of gets into the heart of things because part of the issue is you have been using a loss based congestion control you tend to see a different behaviour if using something like BBR, what are you reporting on, your ability to reach the user with particular for congestion control algorithm or trying to report on what the network is able to sustain as a flow control but you are saying you don't have any particular chosen congestion control algorithm that you use in these tests?
SERGEY FEDOROV: Currently we are optimised ‑‑ we choose ‑‑ we are working on the TCP optimisations in order to maximise the throughput, and we are playing with multiple implementations but in general the question, that is a very important point and something that we debate over constantly, so currently the number that you ‑‑ presentation, how much you can achieve and useful information over the network, versus how much the network can sustain if you are use any possible trick, in many cases if you keep adding connections you will probably utilise the network more and more but it's a question of whether it's the normal usage scenario.
GEOFF HUSTON: I have noticed with BBR that is not true. What BBR does, it's incredibly aggressive about taking what it can do without queuing and knocking everything else out of the way. Not every congestion control algorithm is tare to all the others, which is why I am kind of trying to understand what your measurement is all about, is it what it can achieve by being nice to my neighbours or kick them in the mouth, like BBR does.
SERGEY FEDOROV: Well, in general when on the speed test it's advisable not to run anything else so I would say that in this case those two numbers ideally would be the same, but we are not trying to measure the performance of the network when you have competing traffic with ourselves.
GEOFF HUSTON: Thanks.
CHAIR: Thank you.
(Applause)
So next up is Leslie who is going to talk to us about v6 health with NOMA, it's a really nice restaurant, this is interesting.
LESLIE DAIGLE: It used to be, it's closed now. Hi, so thank you very much for the opportunity to talk about this and I am going to slide this this way so I can see the comfort monitor over there. What I would like to talk to you today about is how to advance ‑‑ some work that I am doing on network operator measurement activity it, and it's a work in progress so I would like to expose a bit of the thinking behind it so you can share your thoughts with me and also a little bit of data, just a taste of data. So the basic principle here for this project is that there are operators out there who have actually instrumented their networks and I will talk more about what I mean with that. And it would be interesting to see if we could get some notion of metrics of the user experience of the Internet from those operators, from operators themselves so this is a different stance to Fast.com, from what we heard just now, it's really more focused on if operators ‑‑ if operators who know the layout of their networks and how they are positioned in the overall Internet are involved in making some of these measurements we can get a different perspective on acomplementary perspective on metrics of the Internet. So what might those be and that is kind of where NOMA comes into the picture, I have drawn this as an iterative cycle because it's computer science and you always iterate, and because it's exploration.
So, last year‑ish John from Comcast stood up at a RIPE meeting and talked about what they have done to instrument the Comcast network, and it really is about doing simple measurements, really simple from within their network to put together some numbers to describe user experience, so what we could do with more operators, did this sort of self instrumentation and could we then export from that to some neutral third party, some notion of those numbers that could be put together as an overall geographic representation of health of the Internet, focusing first on v4 and 6 because that is timely. The simple data is using Libcurl from different points within the network toward target websites. In the context of an operator network target websites are likely to be Facebook and Google and then measuring both the DNS lookup and the time to connect and so on over v4 and over v6. Really simple stuff. Which can then be composed to some kind of insight, if you to the ratios of these two things you get simple numbers back, if you get a number less than one it suggests that v4 is actually faster than v6 for that given measurement space, greater than one the contrary and equal one means v4 and v6 are performing about the same for that region and the idea would be that you could then have a look and see where v4 and v6 are, v4 or v6 is predominantly successful and looking at that over time get some insight into how v6 is being usable for end users in different parts of the world.
So last year the ‑‑ description of what NOMA did, we got together some operators for workshop, their eyes absolutely lit up when they saw this is what it looks like when Comcast its network, we would like to do that too, and there is information on the slide, you can go read the workshop report and some other material as well, but the hard part of it all was, well, as I am sure, you know, good ideas are good ideas but who has the time to go and implement them and get that going in your network and so on and so forth. So there is a bit of a chicken and egg problem; people think this is an interesting idea but how to get actual numbers. So, suddenly, I remembered, gosh, there is this little RIPE Atlas probe that has been hanging out in my basement for the better part of ten years earning me lovely credits with RIPE Atlas, why don't I use that to go and do a simulation of what it looks like for different operators, so that is what I have been doing in the last few months, it is not complete yet but it starts to give a sense of what would this actually look like. I have actual honest to goodness data. So, you can see here just I am going to pause you for a moment and point out this is broken down by numbers which are genuine numbers to 12 different locations in the US, and these are large city locations, this is not ‑‑ these are not necessarily meaningful in a network topology sense, that is a bug and feature, it's necessarily driven by the fact I have no insight into any networks are actually deployed top graphically, a feature because when you start bringing together data from multiple network operators you can actually stop focusing on their particular topologies and make it about user experience view of the world, i.e. cities and whatnot. In this case I use 12 locations, you could make it much more granular and scattered about 200 or 250 locations without any problem.
So, what that data actually was, was data from RIPE Atlas probes, showing the total excuse time, although the DNS resolution in that particular run I didn't do on the probe which is a mistake, it was timed first ‑‑ time for the connection. It was for two networks, I looked at the RIPE Atlas probes separately in the Comcast network and charter network and then what I did with the results was simply to, based on the lat long in the probe description, I located each probe to its closest geographic location of the dozen that I outlined earlier. And I averaged that and I should also point out that in that particular measurement it was everything was measuring to one RIPE anchor, that happened to be located in western Virginia which is kind of sort of central at least in some people's minds in the US, it was just one place to pick. So, what does that say? Well, not really actually a whole lot because it was only one data run. But it does actually start to highlight some interesting things if you start looking at the world through this ratio perspective. It suggests that the best place for v6 in the US is somewhere near Eugene inner began, it suggestion things are kind of hurting in Boston, I don't know really what was going on there but their v4 is really in trouble, but even just looking at the data really simply one of the first things to walk away is v4 v6 is kind of an interesting ratio and give you a quick thumb metric of how things are relatively between v4 and 6 but it hides stuff as well, it hides, for instance, whether there are significant differences like 300 millisecond difference, do you have a happy eyeballs problem so it turns out that the v4:v6 difference is also interesting to look at. If you look at the differences you can see two places, Denver and Nashville that have probably the same v4: V6 ratio have a bigger difference in terms of the relative difference in their timings. So, that is a whole lot of words about numbers on a screen, but hopefully it gives awe sense of the kind of questions you can start asking yourself when you have got that data averaged across a bunch of measurements.
So this is fun. So, what next? I keep slicing and dicing if you ‑‑ I am programming at more RIPE Atlas tests, to do the v4:v6 ratio and difference. Looking at more geographies and targets and networks, doing to just one is not all that interesting and looking at doing it over time so you can see where the dips and valleys are in terms of the actual metrics. So, and also doing the DNS resolution on the probe instead of in the Atlas infrastructure. So this is going to be more data and graphs and fun and excellent but ‑‑ but it's still just playing, it's really super excellent to have RIPE Atlas infrastructure, to be able to do this with real live network data from around the globe but it's still only a simulation, and some of the shortcomings including the fact that the at ‑‑ the probe coverage in these networks is not as good as one would expect if an operator was instrumenting their own network. For instance, here is the RIPE Atlas plot of where the probes are said to be in the first network, as you can see there is kind of scattered and clustered in different geolocations, and it's even more scattered in the second network and this one has, I think like 30 probes, most of which were responsive in this network. So that is a problem. And so what you'd really refer to have if you were to be measuring ‑‑ looking at your own network and trying to figure out what was going on you would really rather have completely cover your access and access networks such that you get a sense of how each of your end users are experiencing the Internet services that they are interested in, so you'd really rather have this measurement being done not in the home so you are not measuring the last mile necessarily, but maybe at the closest layer 3 device to the home, for instance. And as I said, this was using RIPE anchors, you have to use RIPE anchors for HTP measurement in the RIPE Atlas infrastructure for good reasons but in the context of network instrumenting its own space they can readily measure towards real websites to get a real sense of what users are experiencing. And I mean, that starts to be really important from the standpoint of and the kinds of things that Comcast has gotten out of pursuing this in their own networks, it becomes important to use local DNS so that you use the services geolocation, so if, for some reason, Facebook decides by based on the gee location Facebook is sending all of your networks Boston customers to the Facebook data servers in California, that is how your users are experiencing things and that is what you should be measuring and by the way, that is a point at which you might want to make up the phone and say can we have a talk about where my ‑‑ how your geolocating my users and that is also a clue of what operators would like to get out of this kind of instrumentation, that is what is in it for them, is another perspective on understanding how users are experiencing customers ‑‑ customers are experiencing their network.
So, this NOMA thing, it's ‑‑ I am putting it together as collaborative industry activity to not only foster doing those kinds of measurements within networks but also then to get some shared data so we can get a shared perspective and over time evolve a sense of how in the first instance v4 and 6 health is progressing. And the intended outcome is an actual measurement of the Internet stability as an open resource for people to see, but I think it would also be a useful not just for those of who like to come to these meetings and how is the Internet today anyway but also for people who are building up networks in developing regions to have a sense of what does v4 and 6 performance look like in the rest of the world, what should I consider success in terms of building out my own network? And frankly, at the end of the day hopefully we can promote more networks to be objectively intro pecktive in a way that is useful not only to them but also in this collective experience that we all have.
So I guess the take aways I would like you to have from this are that there is v4:v6 ratio and difference are interesting metrics for looking at IPv6 Internet health. And also that it's useful to have the information available publically. As another example, if you have a look at the v6 launch.org, the website for the IPv6 launch activities that was what, five years ago now, that site is still showing statistics of different networks relative percentage of IPv6 traffic towards major content providers, and that is still a really firm resource for network operators to this day to show how well their own v6 coverage is doing, so having this kind of globally publically accessible metric information available is useful to the world at large so hopefully we can build that.
So, if you are an operator, think self instrumentation would be useful, want to talk some more I would be delighted to talk to anybody, I will be hacking away with some RIPE Atlas fun.
(Applause)
CHAIR: Any questions?
AUDIENCE SPEAKER: Shane Kerr from oracle. I think the idea of having different operators collect the data and put it in some place for the researches and other people can look at it, was your intention more for researchers to access this or for operators?
LESLIE DAIGLE: Over all intention is to publish the metrics so probably not all of the raw data but publish the metrics.
SHANE KERR: I see. Well, just one suggestion: The DNS org has been collecting contributed data and giving access to members for several years now, many years now, and it's actually really, really useful for operators and researches in that area so it's a good effort and I would look forward to see where you end up with it.
LESLIE DAIGLE: Thanks.
AUDIENCE SPEAKER: From BIX. I would have a technical suggestion. You showed pure ratio values, I should just to show the ‑‑ value of the ratio because one will be zero, the two will be one and the ‑‑ for comparing the values for the positive and negative direction, it would be zero because the two is the same as the .5, but if you show ‑‑ would show the logarithmic value it would be minus one and plus one.
LESLIE DAIGLE: Okay.
AUDIENCE SPEAKER: You could use any base ‑‑ natural base or ten base.
LESLIE DAIGLE: I will keep that in mind, thanks.
CHAIR: Great. Thank you.
(Applause)
Next we have Robert. Going to give us some updates on Atlas.
ROBERT KISTELEKI: English to follow... welcome everyone, working for the RIPE NCC, R&D department and I am going to give you the current news about RIPE Atlas.
It would be nice to have the slides. Who does not know what RIPE Atlas is? There is one. Daniel, I can tell you privately. So you may recall that at the last time I stood up here I reported that we have a bit of a dip in terms of the active probes that we have in the network so that was roughly half a year ago, I am happy to report that we have recovered from that, waives result of combination of two things: We worked on an enhanced firmware that is now using the USB stick that is in the version 3 probes less often which makes them more stable, but also we reached out to the people who had probes which were down and the combination of the two resulted in a month or so time in an increase of about 3 or 400 active probes so right now we are floating around 9700 probes plus so we are rapidly reaching 10,000 which is a really nice goal to have. We are covering about 3400 in IPv4 space and about 1,200 in IPv6 so that is a decent coverage for stable probes that we can have. And last time I was mentioned in the secret Working Group because of mentioning this number, we are collecting roughly 4500 results every second so that is almost 400 data points a day and we can retrieve any one of them within a second. Kudos to the team. We have 380 or so RIPE ambassadors including staff and I would like to take the opportunity to thank you because you really help extending network and please carry on doing it. Twitter followers is up into the right, mailing list members up and to the right as well, so far this year we have two committed sponsors and we have three more in the making, but this is roughly half of the year so we imagine and we hope that we were going to have more and if you would like to sponsor talk to us, we would love to hear from you.
Some recent use cases, I am not going to go through all of them, but for example, the B root operator used RIPE Atlas to measure the effect of switching from Unicast to Anycast, they have two locations and you will see the how many probes gravitated to the new instance of the single one. We made some measurements ourselves and published a report about the leap second effect so there was one around new year and interestingly enough it did have effect even on the routing system believe it or not. So go and read the details if you want. Funny the last one is a very interesting study, I found that useful which uses the metadata that we have, so it's not the measurements themselves but what the probes do and in particular when they connect and when they disconnect, the researches looked at that and tried to draw some conclusions about which providers do DHCP with providing you a different address if you connect again. It's very interesting, go check it out.
Anchors. If you run a RIPE Atlas anchor that means you are participating as a RIPE Atlas probe but you are also fine with receiving traffic from other measurements, so the RIPE Atlas anchors act as DNS servers, http servers and so on. At the moment we have a bit more than 250 of them up and running, you can see the distribution but on the bottom right graph that the growth of the RIPE anchor specifically is pretty linear and we are happy to see that and would like to thank you all who helped in particular our partners who helped the distribution of the RIPE Atlas anchors in their regions.
Probes. We are looking for potential version 4 probe, v4 although we might just jump into v6 right away, just skip v4. And also on similar note we seem to be in a very interesting position, we have never thought that the version 1 and 2 probes will live so long, the project has been going on for six‑and‑a‑half years now and these guys are still going strong. Except that we seem to be pushing the version 1s over the limit so if your version one probe used to be fine and it's going down nowadays we probably know the reason is we start pushing them over what they can to, we are going to fix that by reducing what they do a bit and then freezing them in place so they are not going to get new features on the probe firmware so if we introduce new kinds of measurements later on they will not get them but keep on supporting them with security updates so they are not going to be dead IoT devices, don't worry. We are looking at whether we should do virtual probes, in the virtual probes basically mean that someone else provides the hardware and we are running in a VM instance. The benefit is we hear it every now and then from operators and people from all over the world that installing a physical device in their network is just a no‑go, that is just not going to happen but they are very happy to give us a VM, half a gig of memory, a gig, that is nothing nowadays, and that will be fine with them. So yes, that gives the potential to extend the network further. But it has a couple of drawbacks from the operational point of view, reliability and so on so we have to carefully evaluate those to see if this is actually a good step or not. And we are going to come back to the community with our proposal very soon, and perhaps even take it to the next level and make virtual anchors as well, who knows.
In another news, what else we did, we have introduced so‑called probe stability tags so these are system tags that the infrastructure itself assigns to the probes if we observe that they are behaving consistently and good enough in particular in v4 and v6 connectivity and measurements. The exact definition of what this means is published in labs article, if you are interested I encourage you to look at at that but basically if your probe behaves well it will receive stable for one day, stable for one week, for a month or so tags. DNS root zone measurements, we have introduced new ones, so‑called built in so they are running on all the probes, and what they do is, they try to basically do queries to the root DNS and somewhat simulate what users experience when they actually look up stuff in their daily lives. The benefit of this is that if the root system is under attack, we will have a good enough data set or so we believe, to show whether this actually affected users or not. So was it just an attack on the root system? It may or may not have been successful or partially successful but according to RIPE Atlas were the users affected because if they were not then that is less less of a problem if they were.
Thinking about something that doesn't really have a stable term yet, I quoted it as Cloud reachability, because what we hear there are a bunch of folk throughout who would like to be measured and they are running their services in the Cloud so they have servers or VMs running in Amazon or Google and so forth and it would not really make sense for all of them to be measured, imagine that hundreds of institutions would want to be measured and they are all in Amazon island. It would be enough to measure Amazon island and all the people who have instances this would have at least basic reachability measurements towards their infrastructure, so we are thinking about that and there are some members and people in the crowd, they seem to be supporting this at least notionally so we will look at that.
We also had a DNS measurement hackathon very recently, and I understand that Vesna is going to give an update about that in the DNS Working Group so if you are interested in DNS and measurements and what hackers can do about those, I encourage you to go there and listen to the talk.
We have a shiny new tool called TraceMon and I am not going to go into the details because the presentation after me is going to be way more precise than I can ever be, because it is a really cool tool and you want to use it so please stay around and listen to mass mow's talk.
This is almost a vanilla slide, not much changed here, what did change is we made actual steps into putting this into production and we are cooperating with former Terena to start using this so this is in kind of a pilot phase, I think there are a handful, like five or ten, probes do measurement already, and we will see if this is a good thing and we should extend it or not but it's definitely an opt‑in thing so tonight worry, your probes will not just suddenly do wi‑fi measurements without your knowledge. Open IP map this is a project that was initiated back in the day by our colleague Emile Aben, it seems to be a very good idea, we gather from the feedback from you so we took it or are taking it to the next level and you can expect a production release very soon now, and we will continue working on it throughout the year to make it even better, and this is just a screenshot of what it is going to look like and it's going to be cool.
So that is about it. Do you have any questions?
AUDIENCE SPEAKER: SIS net, I would like to ask about RIPE Atlas anchor replacement and especially with there is some new hardware plan because the manufacturing of ‑‑ it doesn't look good with them, so if you are like, if you are considering some replacement ‑‑
ROBERT KISTELEKI: Excellent question, thank you. The short answer is, yes, the slightly longer answer is that this was a surprise not only to us but even to the supplier of /SAO*EBG boxes ‑‑ the good news is our supplier has a stock of, I don't know, 50 or 60 anchors and they basically dedicated that to us or to the RIPE Atlas anchoring project so we seem to have a steady supply for the next half year or so. That said, the infrastructure team is already looking at options to be used as annex generation RIPE Atlas anchor hardware. They don't know what it's going to be and they just started with that and this does not affect the currently running anchors so eventually they will come up with lifecycle replacement of course and we expect by that time we have more information and a candidate hardware or hopefully an actual chosen hardware for next generation.
AUDIENCE SPEAKER: For standing up for Andrei which is not in this room there is some device from cz.nic that ‑‑
ROBERT KISTELEKI: Indeed that our infrastructure team is aware of that and they are looking at it as well. And they want to talk to you in the hallways.
BRIAN NISBET: HEAnet. So, first off, that is good news on the V1 probes, I will try and rehabilitate probe 195 and get it back in operation.
ROBERT KISTELEKI: We will put them on life‑support so if they seem to be dead we will resurrect them.
BRIAN NISBET: And the other thing is, how, if the wireless monitoring is opt‑in, how does one opt‑in?
ROBERT KISTELEKI: We have the feature, it's not rolled out on the UI just yet but it's a click of a button, we want to roll it out once we are satisfied that the feature is actually working orb the wi‑fi measurement I should say is working the way it is intended, for the moment we would like to avoid people jumping in and opting into something that may be flakey.
BRIAN NISBET: Okay. Well to say we will ‑‑
ROBERT KISTELEKI: We will put the word
BRIAN NISBET: We will be very interested in opting in when it's available.
RANDY BUSH: IIJ. I might throw this out because other people may feel differently so I just want to poke the pig. When it comes to a new anchor, the cost of the anchor box is less than half ‑‑ significantly less than half of what it costs us to put one in. Okay. So, you know, get a good box.
ROBERT KISTELEKI: Understood.
RANDY BUSH: Thank you.
ROBERT KISTELEKI: Thank you very much.
(Applause)
CHAIR: Thank you, Robert. And next up already announced Massimo Candela for the TraceMon.
MASSIMO CANDELA: So, good afternoon, I am from the research and development department of RIPE NCC. As of kind of tradition for the RIPE meetings also this year I would like to show awe new tool, it's called TraceMon and it's about ‑‑ so I kind of tried to find some goals and I called them daily struggles, I divided them two cases, A reaches B, if that happens we are happy and can go have lunch. But sometimes maybe we are interesting to show how, is it optimised, is it peering through the IXP ‑‑ is it passing through the IXP we just start peering with, which autonomous system is involved, what is the latency between A and B and where so which entity, which node of the CDN we have reached, from which source of trace route, and a question that is common, like if A and B are both in the same country is it traffic going out, we know this question, we have a tool to answer to it.
The second part is A doesn't reach B so where does it stop, which autonomous system, the geographical allocation, who is involved? Which portion of the network and if the trace route stops in wild card or trace route that happens. How can we start troubleshooting and contact someone because we don't know anything about it, and this is also another question that often we have so to get contacts. And of course we do all these measurement AP level and we would like to know why that is happening, what is going on at the BGP one.
So, a good answer is to use trace route and what better than using Atlas and I am not going to bother you with Atlas but you do multi source trace routes so select this hardware devices and every 60 seconds do a trace route to my target. That is really nice, but as the dog said multi match text it's a lot of text and probably ‑‑ why we have ‑‑ we have for ping, we have for BGP but there are not so many for trace route, because trace routes are complex, it's a complex model, they have a lot of anomalies and it's difficult to identify what is a node in a trace route, a single one, and also a lot of data we have to find way to filter to simplify that so I mean, who we are to just throw away some part of the information so it's a complex model and complex view, you can try to use some of this generic graph tools where you can do graphs of peers and Apples, you gather your JSON and put it into the tool and get the graph out it's too much work and daily no one is going to do it and also it's you don't have the evolution of the topology and you cannot drill down in information so, TraceMon. So, TraceMon is a web application, so you can run it in your browser, you just get trace route as input, and and you can visualise with that multi source trace route. So, it tries to infer network topology and the characteristic all the elements, the items in the network so the one reached by the trace route. And it uses a lot of data sources, so actually we have for now like 20 tat API, it gives you access to one click set of information that can be used for doing day‑to‑day operation.
We are going to see them later.
This is the main view of TraceMon. The centre, we have the graph, that photograph is essentially you see the top, there are these green nodes, the green nodes at the top are the probes so the source of the trace route and each of this grey line is a single trace route, to each the target. In this case the target is dot CA, the target is here at the bottom, it's orange and red, and so each of them is a trace route and each of them nodes that you see, the dot is an IP address. In this, at this ‑‑ in this layout they are by default annotated with ought on news system numbers so we do IP to IS lookup and for some of the let's say common or important let's say autonomous systems we have short names so we don't have to for example, we don't have to say Google blah‑blah we just say Google, it makes it easier to read and for the rest you have the autonomous systems. So yellow one, as I said, are IP addresses in the trace route but some blue nodes and these are IXPs so this tool, the text automatically all the IXPs they are traversed by the trace routes, we just use the peeringDB information so it's really important to update your peeringDB and we match the peering LAN and when we get it is an IXP we provide all the information about the IXP and it's a blue node. Plus you have some grey node like this, this is a wild card so essentially when we try to the trace route we don't have any answer from the node, and there are other nodes like this that are private IP addresses. At the bottom here, this chart, it gives you the round trip time from that source to reach the target over time. So, in that chart sometimes can happen you have kind of this red spot here, this means you had a packet loss, in this case it's a complete packet loss, the target is RIPE NCC, well, it was not a production service, just an experiment so don't fire me. And you see here this part is when basically we have the complete packet loss so you can use this curver here and click in the centre of this and the topology of the graph on top will reflect the topology in that moment, in the moment where you click in the time‑line. As you can see, we are not able to reach any more the target with this dash thing. If you don't like the annotation with the autonomous systems maybe you want to click on one of these options and get reverse lookup like in the image here or you can get the country code. The country code are also based on a project we have called open IP map, it's a crowd source database for geolocation. If you click on a path, on a trace route, you will get the text format of the trace route, so you will get the thing that you get from the shell including Reverse‑DNS. While approximate you click on a single node, you get a panel with all the information. In in this case we can click on for instance IX and we have the name and IP address and the round trip time, I forgot to put it, the location, the peeringDB information, the routing information and if we are able to see it in the routing information service or RIPE NCC, plus the registry information, that is only with one click, you have this set of blue buttons that I am going to talk later. Here instead we just click on a wild card, that is a bit weird so we cannot get usually any information. This tool tries to guess private IP addresses and in wild cards so basically uses all the trace route we have in the measurement for that target and tries to give a guess and I call it best guess so it's still a guess, about the ‑‑ that node. In this way we can keep showing this buttons at the bottom, they provide this set of information. So, for example, if you click on update location you will get this panel when you can crowd source your information. It doesn't mean you are going to change the location of whatever you want, it's just a crowd source information enough that we are going to take into consideration. You can get routing information so you can get BGPlay view from that resource in the selected time window and RIPE database information plus to get the contacts so you can get, just click on contact holder and you get the tech C plus abuse‑c information with one click. So you can load trace route and you can show a lot of trace route but sometimes you want to focus on tomorrow of them and what you can do is in research and focus you can just start typing something, the auto complete is going to give you the suggestion about the element involved, so you can just type an autonomous system and for example Level 3 and this is going to be focused on Level 3 and the other part disappear almost. So you can put this AS and other or this probe and things like this, a bit more advanced. You have also things like outcome, reached or not reached. So this was a lot of fun to implement, a feature I call it network annotations, it's a prototype research going on. So essentially in this case it's a measurement to Akamai, so TraceMon is able to understand that different probes they resolve the DNS with different IP address and actually visually we have different trees. At the same time, some of these targets they are in the Akamai autonomous system and the tool tries to annotate this with CDN, and while instead some of the targets like this one are not in the Akamai autonomous system and tries to suggest with local cache plus when our probe is not able to do any more, it's not able to issue the measurement you can click, there is this triangle red you can click and get the log of the error, DNS fail to resolve here. Of course, I put the replay history so you can basically have the animation of the graph over time, just press play on the top here, and the trace route are coming in the service and for the same source and the same target there is a change in the trace route output. This is going to be reflected in the path with some path change. When they disappear is because the probe is not issuing any more trace route results and for some time so I basically consider it disconnected.
So just to close this. TraceMon is open in the sense like it's OpenSource and you can get the code on Git and please help me to develop this for the code and suggest feature. It's a lot of open research topics, I divided in this three section, network simplification, to simplify and aggregate the data we have, network characterisation to get more insight about the various components involved, visualisation, visualise in a meaningful way and it contains a lot of algorithm research problems. It's open to other data sets including the private one, for now it supports the Atlas format but it's just JSON, we want to create generic form also and can get if you are university working on Anycast we can collaborate and put something visually on TraceMon.
Upcoming feature: One of the most request is autonomous system grouping so you saw in the path the various IP address they are annotated with the autonomous systems and some of these are repeated. So the idea is we squash, when it's the same autonomous system in one single node so you would get a kind of BGP graph of the, made with trace routes. Or also other flexible grouping that you may want. Realtime monitoring, this is already implemented, uses the RIPE Atlas streaming based on web sockets but it's for now not really visually appealing that is why it's not in production. Alias resolution, path colouring, so, for example, for round trip time colouring or something I would like to colour like this is the part of my network, this is the wild Internet and this is the target network. Anomalies detection I am collaborating for doing integration with delay and forwarding anomalies research rom an, Christian, Randy and Emile and I would like to introduce out of filtering to focus and filter on what we consider interesting compared to historic behaviours and I am collaborating about the periodic trace route research.
So, that is all. And thank you for your attention. If you have a question, I am going to answer?
(Applause)
AUDIENCE SPEAKER: Hi. So, Christian could have man. So first of all I like it a lot, it's have cool and sexy tool so thanks for that. Two questions and they came up when you showed the Akamai example. So, where does the system chooses the probes because you show a certain amount, you had, five, six, sevenish, do I choose them on geographic base, do the system choose them randomly?
MASSIMO CANDELA: Tightly coupled with RIPE Atlas this is the measurement you created already and when you create you select the probes and you select based on geographical region or whatever you want. The tool by default, 1,000 measurement is not going to scale, I mean whatever I can do to ‑‑ if I don't aggregate whatever I am going to visualise with lines 1,000 is not going to work so what the tool wants to do but for now is still open is not implemented is to have a smart way of sub selecting this probe, imagine you create a measurement with 1,000 probe and I select like, I don't know, ten of them, distributing them in various geographical regions so after you created them I pick one of the one you use to create it. That is future development we dont' have for now. For now it's whatever you create if Atlas.
AUDIENCE SPEAKER: Okay. And I have another one. So giving the software to GitHub and I can install it myself. Is there actually a version of a TraceMon on the RIPE web page so I don't have it to install it and it is publically available.
MASSIMO CANDELA: Yes, we have it on Atlas and when you create a trace route you go in the measurement list and just open the trace route measurement that you created and you have a tab called TraceMon, you click on that and you can visualise your ‑‑ this is a widget also so it means you don't even need to download the code, if you want it in your monitor in your data centre, there is a recommendation, you can find three lines of JavaScript code and grab the code from our server and run it in your browser. But if you want to change it and to modify and use our own data you can get the code.
AUDIENCE SPEAKER: Thanks a lot.
DANIEL KARRENBERG: With the RIPE NCC and I do know about Atlas. Thank you. It's a really cool tool, and you guys just suffer from attacking the hard problems first. Because I think you said like if I have 1,000 trace routes I cannot visualise them because I have to aggregate or select or something like that. But it might be a good idea to just limit the amount of hops that you show from the target because sometimes I am really interested in how it is close to the target so if you had a knob that just said, show me the last three hops before it hits the target you could actually visualise 1,000 trace routes. So, why don't you just implement the easy features first.
MASSIMO CANDELA: Thank you for your question. So actually that feature, it's implemented but it's, there is in the HTML a comment remove it. The problem is that I can fix ‑‑ I can set the amount of hops from the target, the only problem is that I got feedback that people wants to know also the source of the trace route at the same time so it was ‑‑ it's a prototype so I had to select a set of features to implement and the thing I is it for now is just since I didn't find in time a way to visualise the source in the trace route without annoying too much the graph I commented but I can let you try it.
DANIEL KARRENBERG: I am a user who doesn't need to see the source. Please make one that just shows the last few hops.
MASSIMO CANDELA: I will. It's just a ‑‑ it's going to be developed more and more.
AUDIENCE SPEAKER: Alexander. You see I really like the tool you are presenting, it looks really sexy but I am sorry to say it but it's not working on the website.
MASSIMO CANDELA: It's not working on?
AUDIENCE SPEAKER: Just ‑‑
MASSIMO CANDELA: It doesn't support for now any Internet explorer.
AUDIENCE SPEAKER: It's Chrome.
MASSIMO CANDELA: We can check later. Is it integrated in Atlas or where are you watching it?
AUDIENCE SPEAKER: On your website.
MASSIMO CANDELA: On Atlas.
CHAIR: Thank you. Thank you very much.
(Applause)
And finally, we have Alexander, he is going to talk about, give us some examples from Russian hackathon.
ALEXANDER ISAVNIN: Russian hackers became a popular topic last times ‑‑ who had collections in your country this is wrong ‑‑ find a teenager in your country. So, actually you may know and even seen presentation today RIPE NCC organises some hackathons related to measurements twice a year but not in our region. So, we decided to organise some in our country for our people, mostly just to have fun. And as a ‑‑ it's not stated here, Russian regulators try to regulate Russian Internet but just in case they are not ‑‑ they don't know well and community does not know well what regulators trying to regulate. So we had to try to measure and have qualified response for that. But first of all having fun.
We already had two hackathons and in short ones and another one, so relaxed way, we do not hack overnight. Organisations, by volunteers, the first two hackathons informs Moscow at community centre, we have ‑‑ working for innovations, called boiling point. We do not have a lot of sponsors so we just buying some coffee and tea because, you know, and I have some with me, during BoF which will be at night today we will test it. And we have really low budgets, the first hackathon had about €50, the second one about €30 because we studied how to to things in more economical ways so somebody in this audience, the after next session will tell you running ‑‑ is very expensive, just say to them that they are wrong.
Well, so about hackathons. The first hack ton had about 30 registrations, and we nearly finished 2.5 projects. Finished projects it was about comparing Russian open data with RIPE NCC so in Russia there are ISPs have have a licence so from RIPE NCC who supported this tried to compare available data on Russian licences with ISP with least of LIRs.
Also our hackathons is fun community, it's not strongly related to programming, so we had a project for tat journalism and not related for exactly programming. It's called critical infrastructure of Russian Internet analysis based on European approach. Those of you from Germany knows they have kind of legislation of critical infrastructure. Actually Russian legislators also try to introduce the regulator, it was the same name, so some people tried to understand what organisations and infrastructure will fall into regulation of critical infrastructure if we use German approach in Russia. So it's ‑‑ it was kind of interesting project. So it's not programming task so not to present. I can show you some slides in local language, you can read them in presentations if you want. The second ‑‑ it was a really programming task, the people who never used RIPE Atlas or stats before tried to understand what were the borders of Russian Internet, also by comparing RIPE Stat data and RIPE Atlas data.
Actually, reform liesing this in informal way it was comparing routing announcement data from stat and trace route data from Atlas. And we see that actually not much data correlates so it's an interesting task for study later.
Second hackathon, a little more reg strayings but for the second day also two‑and‑a‑half projects survived. The interesting outcome was study of IPv4 space in Russia, how much left, and we will see that three‑quarters of /8 of IPv4 sprays registered in Russia but not announced. Somebody keeps it in stock. It's not finished, not went to final presentation but the guy who worked on this is here in this conference.
The second one was about, we tried to understand how infrastructure, well you have seen Russia is a long country and how infrastructure is distributed amongst this country. We tried to use strategy owe location but the peers is very poor.
There was a lot of slides, I will show only some of them from that presentation. You see the prefixes, IPv4 prefixes registered in Russia is somehow distributed like this, but IPv6 is located only in one place and we checked it with the map, so if you like extreme tourism you can try to find where all IPv6 in Russia in Siberia.
So, and the second project of the second hackathon was actually from ‑‑ he already knows what Atlas is and used it but it was touring hack ton he discovered some new unknown for him features and also during working on this project he also discovered bug in Atlas interface and reported and it was fixed, and some day. So this project was trying to verify the geolocation of RIPE Atlas probes using on trace route data which goes to ‑‑ and from this probe. And as I said, it was ‑‑ it gave us professional programme so you can follow his GitHub and check data and code and maybe use it. But this is one picture from his presentation. So, probe is located somewhere in United States, but the trace routes from this probe respecting to the speed of light shows probe actually somewhere inside of the circles, so it might be very interesting for Atlas people to develop it more. And I hope this guy will join our about which I want to talk later.
About outcome. Actually, for us it was funny and we did not spend much time ‑‑ much money on this. There is still much confusion and unclear things when we are talking about Internet, especially with government people. Not much people in our region is aware of statistical and measurement tools which are available for them, and for sure I will try to engage them more.
And we are running the third hackathon it would be in two weeks, it's weekend before conference in Petersburg so we are running in Petersburg, if you are going to visit conference you are welcome. Also, if you need other reasons to go to ENOG you can go together with hackathon.
So, everyone is welcome. And you can propose tasks and become sponsor and suggest prizes. So, thank you.
(Applause)
CHAIR: Thank you very much. Any questions for Alex?
AUDIENCE SPEAKER: Hi, thank you for your presentation. I am Sergey, I have a question, how likely you consider the situation if the results, the outcome of the hackathons will be used by the governmental structure or similar to the structures, in order to build another protection level or another digital border for the Internet?
ALEXANDER ISAVNIN: Actually tasks and one of the purposes of these hack tons was completely opposite, it was to understand how we could oppose such regulations, and how we could Rye to reject or push rejection of such legislation, of protecting critical and non‑critical infrastructures.
AUDIENCE SPEAKER: Are you announcing this goal, this approach when promoting the hackathon?
ALEXANDER ISAVNIN: The first hackathon was organised together with Internet protection society which is our organisation stating that it is opposition to current government and bringing Internet as of one of the point of ‑‑ to one of the points of political agenda.
AUDIENCE SPEAKER: Thank you.
ALEXANDER ISAVNIN: And I think you as Russian citizen not Internet citizen should know I am not supporting current Russian government regulation.
AUDIENCE SPEAKER: I am neutral.
CHAIR: Thank you very much. Thank you, Alex.
(Applause)
So we have reached the final point of today, which is any other business. And I actually have two things that I'd like to announce, and it's on behalf of the Programme Committee, I want to remind you about the elections for the Programme Committee for the plenary part of the RIPE conference, voting is ending tomorrow at 5:30 so please if you haven't already voted for your favourite candidates for the Programme Committee, go vote. On the front page of web 74 [at] ripe [dot] net. Ask Brian. At the front page ‑‑ where on the conference website. The other thing that we should remind you is that Friday morning, the day begins with a session about increased diversity at the RIPE conference and the RIPE community in general and it begins at 9:30, 9:30 Friday morning. So you get to sleep in. So, show up and participate in that session, it's super important. And then anybody has any other business they want to talk about, say? No. Coffee time.
(Applause)
LIVE CAPTIONING BY AOIFE DOWNES RPR
DOYLE COURT REPORTERS LTD, DUBLIN IRELAND.
WWW.DCR.IE