What Is URL (Uniform Resource Identifier) Encoding?
Definition
URL encoding is an encoding format used in URLs. The standard allows the use of arbitrary data inside a Uniform Resource Identifier (a URI; typically a URL) while using only a narrow set of US-ASCII characters. The encoding exists because URLs and HTTP request parameters often contain characters (or other data) that cannot be represented with the limited set of US-ASCII characters (i.e. control characters, etc.).
Reserved and unreserved characters
In general, a URI can contain characters that are either reserved or unreserved. Unreserved characters are characters that have no special meaning; they can be displayed as-is and require no special handling. These include uppercase and lowercase letters (A-Z
, a-z
), decimal digits (0-9
), hyphen (-
), period (.
), underscore (_
), and tilde (~
).
Reserved characters, on the other hand, are characters that may delimit the URI into sub-components: characters such as / # &
and others. The following is the list of all reserved characters: ! # $ & ' ( ) * + , / : ; = ? @ [ ]
.
We cannot use reserved character as-is, because this would create ambiguous URIs. For instance, consider URL http://example.com/foo#bar
. Does this URL point to an anchor #bar
inside resource /foo
, or it points to a resource /foo#bar
, that is, a resource whose name contains character #
? Without URL encoding it would be impossible to tell.
We resolve such ambiguities by encoding reserved characters differently when used as data; when used as delimiters, we encode them as-is.
Percent encoding
To encode reserved characters, we use the percent-encoding scheme. In percent-encoding, each byte is encoded as a character triplet that consists of the percent character %
followed by the two hexadecimal digits that represent the byte numeric value. For instance, %23
is the percent-encoding for the binary octet 00100011
, which in US-ASCII, corresponds to the character #
. Strictly speaking, while the percent character (%
) isn't reserved, it nonetheless serves as a special indicator for percent-encoded bytes (and therefore requires special handling). Simply put: it must also be percent-encoded (as %25
).
So with percent-encoding, we know that URL http://example.com/foo#bar
points to an anchor bar
inside resource /foo
while http://example.com/foo%23bar
points to resource /foo#bar
where character #
is encoded as %23
.
Other characters
Percent encoding is also used to represent other characters; characters that are neither reserved nor unreserved. As an example, imagine a GET request containing a non-ASCII string parameter, such as a search query zajec in jež
which is Slovenian for a rabbit and a hedgehog
.
In such cases, we have to first encode non-ASCII characters as UTF-8
and then encode each byte of the new string with percent-encoding. So if we send a GET request to the Duckduckgo search engine containing search query zajec in jež
, we generate the following URL: https://duckduckgo.com/?q=zajec%20in%20je%C5%BE
Encoding the space
character
You may have seen cases where the space
character was encoded as character +
, however, the percent-encoding suggests it should be encoded as %20
(in US-ASCII, the space
character is 20
hexadecimal or 32
decimal). So what is going on?
Such encodings are typically created by HTML forms. When a user submits an HTML form, the data is URL-encoded using an early version of the URI percent-encoding rules that contained a number of modifications such as replacing spaces with +
and others.
Note however, that using the +
instead of %20
is valid only when encoding the application/x-www-form-urlencoded
content, such as the query part of an URL. To make this clearer, consider the following cases.
http://www.example.com/search+script.php?search+query=search+term
In this URL, the resource being requested is
search+script.php
(the plus character (+
) is part of the filename), while the parameter name issearch query
and its value issearch term
– in the name of the query parameter and in its value the+
sign is converted tospace
while in the name of the resource,search+script.php
, the+
sign remains.http://www.example.com/search+script.php?search%20query=search%20term
This case is identical to the example above. The difference—using
%20
instead of the+
sign in parameter name and value—is only superficial. Both URLs point to the same resource,search+script.php
, and they contain the same parameters.http://www.example.com/search%20script.php?search%20query=search%20term
This example, however, is different. Here the resource name contains the actual
space
character, so the name of the requested resource issearch function.php
; the request parameter names and values remain the same as above. Consequently this URL is different from those above.
A URL encoder
The application below performs URL encoding and decoding on arbitrary strings. Feel free to test it out (HTML).
Input <br>
<input type="text" name="input" id="input"><br><br>
Output <br>
<input type="text" name="encoded" id="encoded">
<script>
let input = null;
let encoded = null;
document.addEventListener("DOMContentLoaded", () => {
input = document.querySelector("#input");
input.onkeyup = encode;
encoded = document.querySelector("#encoded");
encoded.onkeyup = decode;
});
function encode(event) {
encoded.value = encodeURIComponent(input.value);
}
function decode(event) {
try {
input.value = decodeURIComponent(encoded.value);
} catch (error) {
input.value = "Invalid URI string";
}
}
</script>