[cabfpub] ICANN Presentation on Internal Names

Sun Jul 21 15:13:47 MST 2013

Here is an excerpt from part of the ICANN presentation in Durban last week
(transcribed from ssr-17jul13-en.mp3)-the actual report should be out in
about two weeks.

Lyman Chapin speaking:

This presentation provides an update on a study that my company, Interisle
Consulting Group, was commissioned to perform on the incidence and potential
consequences of inclusion in the DNS, and it includes some of the topics
that we're talking about.  I want to start out by describing what we mean
when we use the term "name collision," which is not necessarily familiar to
everybody, and I apologize for those of you who understand the way the DNS
works at a fairly detailed technical level because you will also come up
with 50 reasons why this is wrong with respect to details, but bear with me.
In the world in which we live right now, we've not delegated any new TLDs
for a long time.  You can imagine that someone using a computer, either an
individual or an application, uses local names to access resources,
typically in a local environment.  You would use something like
printer.myname.  You wouldn't necessarily use a fully qualified DNS name,
and your local network resolves printer.myname and knows where to find your
printer.  That printer.myname string is like a DNS name with the ".", and so
forth.  But the local namespace is only meaningful within the context of
your local network.  That namespace is not a global namespace where the
public DNS is, and if you ask the public DNS about printer.myname, it will
tell you that name does not exist because "myname" is not a registered gTLD,
and so it comes back with what we call an n-x domain or non-auth response.
If you look at what will be after we have delegated a bunch of new gTLDs,
for example, should ICANN delegate myname as a gTLD, and I deliberately
obviously chose one that is not on the list of current proposal TLDs, and
then someone comes along registers the name "printer" at the second level in
the new TLD through the registrar certified.  So now "printer.myname" is a
global DNS name and now if you ask the DNS about printer.myname, instead of
going to something on your local network, you'll get a pointer to there.
That is essentially it at the highest level what we mean when we talk about
name collision. 

It is like trying to interpret a name in one semantic domain or in the
context of one namespace when it properly belongs somewhere else.  So it's
really a namespace collision as opposed to a name collision issue.  So we
were asked to conduct a study to find out, first of all, how likely it is
that they were going to see this in the real world, because until now is the
theoretical possibility.  Everyone might imagine how it might happen, but is
it really going to happen.  And second, what effect might that have?  And we
do see name collision after we delegate new TLDs.  What effects might that
have on security and stability?  And then of course, given that we might
find such effects, what options do we have to deal with them either before
or after the fact?  Where the fact in this case is the delegation of the new
string?  So we looked at the best data sets that are available-the
historical data sets that are available for queries to the root servers.
These are questions that are being presented to the root of the DNS
basically saying, "can you tell me information about the following name ?"  

We had two good large samples.  There is an exercise that was started by an
organization in 2003 that has been carried on every year since then called
"a day in the life of the Internet."   It's an exercise in which an
organization called DNS OARC captures or receives captures of packets that
have been sent to each of the root servers over a continuous 48-hour period.
Actually, a minimum 48-hour period.  Almost all of the root servers the 13
individual researchers participate in this exercise.  It's a very good
uniform place to go looking for things that might be happening in the query
string to the root.  So we took the two data sets, which together comprise
about somewhere between eight and 10 TB of information, and we look for
proposed TLD names in those data sets.  We also then after did some work to
investigate some of the potential consequences to focus on--what happens
when resolution of a name is ambiguous, when it's not possible without
additional contextual information to determine how to resolve the name, and
the other issue is the one that has to do with the internal-domain name
X.509 certificates - the online public certificates that Dan is talking
about.  

The last part of our study was to investigate some of the options that might
be available, not just for ICANN, but for the community at large, to
mitigate the effects of these collisions in those cases in which the
consequences are severe.   So we look at the query string to the root, and
this is for 2013, and we see that there's actually some good news.   About
55% of the queries that the root servers receive our questions about actual
existing TLDs.  This is what you would expect.  In a perfect world, of
course, that would be 100%, because nobody should be asking the root about
things that don't exist.   In this slide, what we call the proposed TLD, is
one that has been proposed as a TLD in the current TLD round.  So, one of
the 1,930 original names that were in the pool.  Three percent of the query
strings, keep in mind that these are before any of these have been
delegated,  3% of the query strings consist of questions about those that
are on the proposed list.   19% of those, what we call "potential TLDs" are
those strings that are neither.  They don't exist as TLDs that have been
proposed, but they are syntactically valid strength.  In other words they
could be a TLD in the future if someone proposed them-their syntax is
correct.   Then 23% of it is just garbage and they are invalid strings that
could never be a TLD because they don't obey the syntax rules for top-level
domain labels.   

This is a list of the most queried TLD strings that covers existing proposed
and potential TLDs, so everything except the 23% that are invalid.  It's
important to point out first of all these numbers are in the thousands, so
in the 2013 data set what we see for .com is roughly 8.5 billion queries--
not surprisingly .net and .org are on that list as well.    If you leave out
".local" which is a special case that got lumped in with the rest, for the
root itself, the next item on the list is the string "home" with just over 1
billion queries, and if you go down the list if I continued this list out to
the top 100 most queried TLDs, there would be 13 of the proposed TLD strings
on the list.  Of the proposed TLD's, the ones that have actually been
proposed in this round, on this chart it shows the rank and accounts for the
most queried proposed TLDs, and there is an interesting thing to note about
this chart, not just the fact that there is a fairly accurate power-law fit
to the distribution, "home", "for," "ice," "global," and down, it roughly
follows a power law.   Although for "home" and "corp" we have a pretty good
sense of what's causing those strings to appear in queries when they're not
actually currently delegated TLDs, it was kind of a shock to see "ice" as
number three on the list.   

To give a sense of how to think about this, just occurrence in the number of
times you see a string in a query doesn't tell you a lot about whether or
not that's a good thing or a bad thing or a neutral thing.  It just tells
you how often the string appears.  The additional information you need to
make any kind of a risk assessment is how serious the consequences might
ensue if you saw the string and it collided with a delegated TLD.  So as an
example of an event that occurs very frequently, but has no negative side
effects, is one thing, or an event that occurs very infrequently, but has a
really serious side effect, like a meteor strike, or something like that.
It is always a product of those two factors that lead you to an assessment
of risk.  So just the fact the string occurs a lot, looks scary, but not
necessarily so.  So if you go down this list, you ask yourself, well
"home"-- we pretty much have a handle on why that is occurring.  There are
lots of routers and DSL modems and so forth configured to use "home" in the
local environment.   ".corp" - we've talked extensively about the way in
which Active Domain name configurations frequently setup active directory
name configurations for quick setup using .corp as the top level domain.
".ice" turns out to be the electric utility co-op in Costa Rica, which for
some reason is blasting .ice requests out to the root in the third position
on the string.  You can imagine that that the occurrence-and-consequence
product for that would be very different than it might be for some of the
other things on this list.  This list obviously goes on for a long time.  

In 2013 there were only 14 of the proposed TLDs that never appeared, so this
distribution has a very large head and a very long tail, and there are lots
of things out in the negligible occurrence tail, but there only 14 of those
strings that are currently proposed that never occurred.  If we look at
larger data sets-- we have had some informal discussions with individual
root server operators--if we look to larger data sets, it is almost certain
that every single string being applied for will be found somewhere in the
query string to the root.  And if you go down the list, you also see some
entries on here like down at number 14 for 2013 "HSBC," which is highly
likely to be restricted to that particular bank.  However, if you look at
rank number 10, you see Cisco, and you might think that must be the Cisco
Corporation, but in fact, it's most likely an artifact of the way in which
people tend to name the routers-- router1.Cisco.name.com or something like
that.  So that kind of triage that you could imagine doing by looking at the
strings and making sort of informed guesses about what they might mean in
terms of origin, you have to be a little bit careful.   You can see that in
some cases, if you saw IBM on this list, which is down below 15, but it's on
the list, you might assume that that was the IBM Corporation, but you would
want to follow that up with some investigation for sure.   

If you look at the potential consequences-- and again I want before I go
through what looks like a very scary list of bad things that might happen--I
want to point out once again that we need to look at both the potential
consequences and the likelihood that they might occur before we make any
judgment about what the delegation risk might be.   The most obvious
potential consequence right off the bat is that it is likely to change the
way in which local namespace is resolved.  So if there is a collision, you
might find yourself accustomed to a particular behavior on your local
network that might change if the label that you were using as the top level
domain in your local network were suddenly to start resolving in the public
DNS afterwards delegated as a new TLD label.  Search list processing is
likely to change as well search list is a list with maintained by your
operating system or by piece of application software which had suffixes in
order to the string that you might enter as a user of the user interface to
try to create a fully qualified domain name to resolve.   So it might try
with a.example.com this corporationname.com, and try lots of different
suffixes until it comes up with a fully qualified domain name that actually
resolved.  Obviously because today a lot of these strings are not delegated
but will be tomorrow, the way in which the search list cause resolution
behavior to happen in your local environment could easily change.  One of
the consequences of that is the possibility of the various kinds of
applications streams and packets could get misdirected.  

We see when you look at the databases of packet streams to the root.  We see
not only what I call standard request for "A" address records, we also see
request for MX mail exchange records and SRV service records.  Mail exchange
records are typically used by mail processing systems obviously.  SRV
records are most often used by SIP, the protocol used for voice over IP.
The fact that we see these suggests that there are systems that are
configured to look for those or that information from a local root that are
escaping onto the public web.  So when you start to resolve those globally,
it is possible that voice over IP calls could get misdirected. Again, I'll
just add that this is not something that is going to be a wholesale problem.
We are not going to try on a new TLD and suddenly see email across the
entire Internet suddenly going to the wrong place.  These are things that
could happen, and it's important to point out that in order to determine
whether or not they would actually happen in practice would require more
investigation, like what we conducted in this study.  The public key
certificate issue is the one we discussed when Dan gave this presentation,
and it's also covered in the SSAC Report 57.   

The final one on this list, which we spent some time on in the study, and
there are probably other consequences that we don't know about, is the way
in which web browser cookie data are stored when coupled to the fully
qualified domain name.   It may be possible under certain circumstances for
cookie store information on your machine to be accepted in a different
environment where the name resolved differently in such a way as to expose
your cookie data, which would enable someone else to essentially take over
or find out your identity information and masquerade as you in certain
situations.  Again, this is not something that is going to happen to all of
us as soon as somebody turns on the new TLDs.  

The options that that we came up with for potentially resolving some of
these things may apply only to a very small number of strings, or it might
apply to the entire list of strings of the whole, depending on how we decide
to deal with some of these issues.  The obvious one is to permanently
reserve a string that you've decided is both likely to occur in a collision
and is likely to occur with consequences that are very serious.  If both of
those variables have very high values, then the product, which is the
measure of risk, would be very high as well.  There have been some
suggestions over the past few years, particularly within the IETF, to in
fact permanently reserve some of the strings to prevent the name collisions
we've been talking about from happening.   That's a pretty radical step to
take, and again, these are options-- not necessarily things that anyone is
going to do, and in particular, we don't, as a result of our study,
explicitly recommend that ICANN or anyone else follow one of these options.

These represent choices that can be debated within the community.  Another
obvious option is to study the impact, either of an individual string or
name collision in general more than we were able to do in this study to get
a better data set and do more investigation into what the consequences of
name collision might be in specific cases by going on in the world asking
people maybe have experience with some of these things.  There are a lot of
avenues you can imagine exploring, a lot more thoroughly than we did in in
this study.  Obviously that would delay delegation and you would have to
have some kind of termination condition because whenever you suggest that
further study be done before you take action you have to determine how much
study is going to be enough.  When will I know that I've studied the problem
enough to be confident that I can now make a decision? 

A third option, which we called, wait-until-everyone's-left-the-room, or
wait-til-everyone's-gone, that's very similar to what the CAB Forum is
suggesting with respect to the strings that are most commonly used in
internal name certificates, is delay delegation of the string until the
colliding use has stopped.  In that case, you would wait until all of the
internal name certificates that had been issued, with that string in the
subject name or subject alternative name field, had either expired or been
revoked.  That same approach could be adopted for other strings if the uses
are such that they could readily be changed.  So for instance, if the
electric utility in Costa Rica could be convinced to close whatever hole is
causing the .ice queries to escape out into the public Internet as a query
to the public DNS root, that would presumably solve the problem with name
collision for that string.  So there are many ways in which you can you wait
until everyone's gone.  It doesn't necessarily mean wait for 10 years until
your certain that every possible use of the string has died out.  In cases
where you can identify specifically the reasons why a string is used, you
may have options that don't require that lengthy delay.  

The fourth option, which is been discussed and in particular proposed in the
letter from VeriSign to ICANN execs, which is to do what VeriSign calls
ephemeral delegation and what we called a trial run.   That's to delegate
the name in such a way that it isn't being operated by the eventual registry
operator or applicant.  It is being operated either by a third-party / some
trusted party that carefully deploys it establishing the kind of monitoring
that will be necessary to determine if something bad happens, and if
something bad happens, withdraws it quickly.  Obviously you would advertise
those names, for instance, with a short time delivery TTL field so that if
some negative effects were observed you could withdraw it quickly and the
consequences would hopefully be limited.   There are some obvious pros and
cons to that, and I won't go into a lot of the details, because the devil is
definitely in the details with that one, but that is certainly an option.

Comment:   You know on your first option, you can just apply it not to the
whole TLD, but to reserve a subdomain, and those queries would not be
impacted by a collision. 

You would have to be sure that whoever was operating the TLD would never
delegate any further.  So, I'm happy to take questions here.

A question from Kevin Murphy from Domain Insight in London:  If the study
didn't look at the sources of the request for the proposed TLDs, is there
way to figure out how many users or networks might be affected if these TLDs
are delegated?  

We did have an opportunity to collect data on the source.  We collected data
using IP address prefix, so we do have a sense of where the queries were
coming from.  What we did in the study report is we looked not so much
specifically of where the questions come from, but for each of the TLDs that
we looked at, how many different sources did we receive queries from.  So,
because the interesting thing is really, how widely distributed is a query
stream for particular string.  If it is coming from one very particular
place, as it is in the case of the .ice string that I've been talking about,
then you can be pretty confident that this is where this string is escaping
from.  It's limited to at least within where the government topographical
standpoint, IP address prefix, it's limited to a relatively small area.
That suggested it might be easier to figure out a way to stop that
conflicting use than it would be for a string that appeared from many
different IP address.  So yes, that those data are in the report, but I
don't have them in the presentation.