CERT Advisory CA-2000-02 describes a problem with malicious tags embedded in client HTTP requests, discusses the impact of malicious scripts, and offers ways to prevent the insertion of malicious tags.
This tech tip, written for web developers, describes more specifically the steps you can take to prevent attackers from from using untrusted content to exploit your web site.
This document has the following sections:
Web pages contain both text and HTML markup that is generated by the server and interpreted by the client browser. Servers that generate static pages have full control over how the client will interpret the pages sent by the server. However, servers that generate dynamic pages do not have complete control over how their output is interpreted by the client. The heart of the issue is that if untrusted content can be introduced into a dynamic page, neither the server nor the client has enough information to recognize that this has happened and take protective actions.
In HTML, to distinguish text from markup, some characters are treated specially. The grammar of HTML determines the significance of "special" characters -- different characters are special at different points in the document. For example, the less-than sign "<" typically indicates the beginning of an HTML tag. Tags can either affect the formatting of the page or introduce a program that the browser executes (e.g., the <SCRIPT> tag introduces code from a variety of scripting languages).
Many web servers generate web pages dynamically. For example, a search engine may perform a database search and then construct a web page that contains the result of the search. Any server that creates web pages by inserting dynamic data into a template should check to make sure that the data to be inserted does not contain any special characters (e.g., "<"). If the inserted data contains special characters, the user's web browser will mistake them for HTML markup. Because HTML markup can introduce programs, the browser could interpret some data values as HTML tags or script rather than displaying them as text.
The risk of a web server not doing a check for special characters in dynamically generated web pages is that in some cases an attacker can choose the data that the web server inserts into the generated page. Then the attacker can trick the user's browser into running a program of the attacker's choice. This program will execute in the browser's security context for communicating with the legitimate web server, not the browser's security context for communicating with the attacker. Thus, the program will execute in an inappropriate security context with inappropriate privileges.
Any data inserted into an output stream originating from a server is presented as originating from that server, even if it does not include malicious tags. Web developers must evaluate whether their sites will send untrusted data as part of an output stream.
Untrusted input can come from, but is not limited to,
A combination of steps must be taken to mitigate this vulnerability. These steps include
The following sections discuss details of each of these steps.
Many web pages leave the character encoding ("charset" parameter in HTTP) undefined. In earlier versions of HTML and HTTP, the character encoding was supposed to default to ISO-8859-1 if it wasn't defined. In fact, many browsers had a different default, so it was not possible to rely on the default being ISO-8859-1. HTML version 4 legitimizes this - if the character encoding isn't specified, any character encoding can be used.
If the web server doesn't specify which character encoding is in use, it can't tell which characters are special. Web pages with unspecified character encoding work most of the time because most character sets assign the same characters to byte values below 128. But which of the values above 128 are special? Some 16-bit character-encoding schemes have additional multi-byte representations for special characters such as "<". Some browsers recognize this alternative encoding and act on it. This is "correct" behavior, but it makes attacks using malicious scripts much harder to prevent. The server simply doesn't know which byte sequences represent the special characters.
For example, UTF-7 provides alternative encoding for "<" and ">", and several popular browsers recognize these as the start and end of a tag. This is not a bug in those browsers. If the character encoding really is UTF-7, then this is correct behavior. The problem is that it is possible to get into a situation in which the browser and the server disagree on the encoding. Web servers should set the character set, then make sure that the data they insert is free from byte sequences that are special in the specified encoding. For example:
<HTML> <HEAD> <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> <TITLE>HTML SAMPLE</TITLE> </HEAD> <BODY> <P>This is a sample HTML page </BODY> </HTML>
The META tag in the HEAD section of this sample HTML forces the page to use the ISO-8859-1 character set encoding.
The next two steps, encoding and filtering, first require an understanding of "special characters". The HTML specification determines which characters are "special", because they have an effect on how the page is displayed. However, many web browsers try to correct common errors in HTML. As a result, they sometimes treat characters as special when, according to the specification, they aren't. In addition, the set of special characters depends on the context:
It is important to note that individual situations may warrant including additional characters in the list of special characters. Web developers must examine their applications and determine which characters can affect their web applications.
Each character in the ISO-8859-1 specification can be encoded using its numeric entry value. A complete description of the
ISO-8859-1 specification can be found in the appendix of this document.The following example uses the copyright mark in an HTML document:
<p>© 2000 Some Co., Inc.
The copyright character is 169 and using the &# syntax allows the author to insert encoded characters that will be interpreted by the browser.
In addition, many of the ISO-8859-1 characters include an entity name encoding. The copyright can also be done using this method:
<p>© 2000 Some Co., Inc.
Encoding untrusted data has benefits over filtering untrusted data, including the preservation of visual appearance in the browser. This is important when special characters are considered acceptable.
Unfortunately, encoding all untrusted data can be resource intensive. Web developers must select a balance between encoding and the other option of data filtering.
Unfortunately, it is unclear whether there are any other characters or character combinations that can be used to expose other vulnerabilities. The recommended method is to select the set of characters that is known to be safe rather than excluding the set of characters that might be bad. For example, a form element that is expecting a person's age can be limited to the set of digits 0 through 9. There is no reason for this age element to accept any letters or other special characters. Using this positive approach of selecting the characters that are acceptable will help to reduce the ability to exploit other yet unknown vulnerabilities.
The filtering process can be done as part of the data input process, the data output process, or both. Filtering the data during the output process, just before it is rendered as part of the dynamic page, is recommended. Done correctly, this approach ensures that all dynamic content is filtered. Filtering on the input side is less effective because dynamic content can be entered into a web sites database(s) via methods other than HTTP. In this case, the web server may never see the data as part of the input process. Unless the filtering is implemented in all places where dynamic data is entered, the data elements may still be remain tainted.
One method to exploit this vulnerability involves inserting malicious content into a cookie. Web developers should carefully examine cookies that they accept and use the filtering techniques describe above to verify that they are not storing malicious content.
BYTE IsBadChar[] = { 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0xFF,0xFF,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0xFF,0xFF,0x00,0xFF,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, 0x00,0x00,0x00 }; DWORD FilterBuffer(BYTE * pString,DWORD cChLen){ BYTE * pBad = pString; BYTE * pGood = pString; DWORD i=0; if (!pString) return 0; for (i=0;pBad[i];i++){ if (!IsBadChar[pBad[i]]) *pGood++ = pBad[i]; }; return pGood-pString; }
function RemoveBad(InStr){ InStr = InStr.replace(/\</g,""); InStr = InStr.replace(/\>/g,""); InStr = InStr.replace(/\"/g,""); InStr = InStr.replace(/\'/g,""); InStr = InStr.replace(/\%/g,""); InStr = InStr.replace(/\;/g,""); InStr = InStr.replace(/\(/g,""); InStr = InStr.replace(/\)/g,""); InStr = InStr.replace(/\&/g,""); InStr = InStr.replace(/\+/g,""); return InStr; }
#! The first function takes the negative approach. #! Use a list of bad characters to filter the data sub FilterNeg { local( $fd ) = @_; $fd =~ s/[\<\>\"\'\%\;\)\(\&\+]//g; return( $fd ) ; } #! The second function takes the positive approach. #! Use a list of good characters to filter the data sub FilterPos { local( $fd ) = @_; $fd =~ tr/A-Za-z0-9\ //dc; return( $fd ) ; } $Data = "This is a test string<script>"; $Data = &FilterNeg( $Data ); print "$Data\n"; $Data = "This is a test string<script>"; $Data = &FilterPos( $Data ); print "$Data\n";
Number |
Name |
Description |
Appearance |
�- |
- |
Unused |
- |
	 |
- |
HorizontalTab |
space |
|
- |
Linefeed |
space |
- |
- |
Unused |
- |
  |
- |
Space |
space |
! |
- |
Exclamationmark |
! |
" |
" |
Quotationmark |
" |
# |
- |
Numbersign |
# |
$ |
- |
Dollarsign |
$ |
% |
- |
Percentsign |
% |
& |
& |
Ampersand |
& |
' |
- |
Apostrophe |
' |
( |
- |
Leftparenthesis |
( |
) |
- |
Rightparenthesis |
) |
* |
- |
Asterisk |
* |
+ |
- |
Plussign |
+ |
, |
- |
Comma |
, |
- |
- |
Hyphen |
- |
. |
- |
Period(fullstop) |
. |
/ |
- |
Solidus(slash) |
/ |
0-9 |
- |
Digits(0-9) |
0-9 |
: |
- |
Colon |
: |
; |
- |
Semi-colon |
; |
< |
< |
Lessthan |
< |
= |
- |
Equalssign |
= |
> |
> |
Greaterthan |
> |
? |
- |
Questionmark |
? |
@ |
- |
Commercialat |
@ |
A-Z |
- |
UppercaseA-Z |
A-Z |
[ |
- |
Leftsquarebracket |
[ |
\ |
- |
Reversesolidus(backslash) |
\ |
] |
- |
Rightsquarebracket |
] |
^ |
- |
Caret |
^ |
_ |
- |
Horizontalbar |
_ |
` |
- |
Acuteaccent |
` |
a-z |
- |
Lowercasea-z |
a-z |
{ |
- |
Leftcurlybrace |
{ |
| |
- |
Verticalbar |
| |
} |
- |
Rightcurlybrace |
} |
~ |
- |
Tilde |
~ |
-Ÿ |
- |
Unused |
- |
  |
|
Non-breakingspace |
|
¡ |
¡ |
Invertedexclamation |
¡ |
¢ |
¢ |
Centsign |
¢ |
£ |
£ |
Poundsterlingsign |
£ |
¤ |
¤ |
Generalcurrencysign |
¤ |
¥ |
¥ |
Yensign |
¥ |
¦ |
¦ |
Brokenverticalbar |
¦ |
§ |
§ |
Sectionsign |
§ |
¨ |
¨ |
Umlaut(dierisis) |
¨ |
© |
© |
Copyright |
© |
ª |
ª |
Feminineordinal |
ª |
« |
« |
Leftanglequote,guillemotleft |
« |
¬ |
¬ |
Notsign |
¬ |
­ |
­ |
Softhyphen |
|
® |
® |
Registeredtrademark |
® |
¯ |
¯ |
Macronaccent |
¯ |
° |
° |
Degreesign |
° |
± |
± |
Plusorminus |
± |
² |
² |
Superscripttwo |
² |
³ |
³ |
Superscriptthree |
³ |
´ |
´ |
Acuteaccent |
´ |
µ |
µ |
Microsign |
µ |
¶ |
¶ |
Paragraphsign |
¶ |
· |
· |
Middledot |
· |
¸ |
¸ |
Cedilla |
¸ |
¹ |
¹ |
Superscriptone |
¹ |
º |
º |
Masculineordinal |
º |
» |
» |
Rightanglequote,guillemotright |
» |
¼ |
¼ |
Fraction(onequarter) |
¼ |
½ |
½ |
Fraction(onehalf) |
½ |
¾ |
¾ |
Fraction(threequarters) |
¾ |
¿ |
¿ |
Invertedquestionmark |
¿ |
À |
À |
CapitalA,graveaccent |
À |
Á |
Á |
CapitalA,acuteaccent |
Á |
 |
 |
CapitalA,circumflexaccent |
 |
à |
à |
CapitalA,tilde |
à |
Ä |
Ä |
CapitalA,umlaut(dierisis) |
Ä |
Å |
Å |
CapitalA,ring |
Å |
Æ |
Æ |
CapitalAEdipthong(ligature) |
Æ |
Ç |
Ç |
CapitalC,cedilla |
Ç |
È |
È |
CapitalE,graveaccent |
È |
É |
É |
CapitaE,acuteaccent |
É |
Ê |
Ê |
CapitalE,circumflexaccent |
Ê |
Ë |
Ë |
CapitalE,umlaut(dierisis) |
Ë |
Ì |
Ì |
CapitalI,graveaccent |
Ì |
Í |
Í |
CapitalI,acuteaccent |
Í |
Î |
Î |
CapitalI,circumflexaccent |
Î |
Ï |
Ï |
CapitalI,umlaut(dierisis) |
Ï |
Ð |
Ð |
CapitalEth,Icelandic |
Ð |
Ñ |
Ñ |
CapitalN,tilde |
Ñ |
Ò |
Ò |
CapitalO,graveaccent |
Ò |
Ó |
Ó |
CapitalO,acuteaccent |
Ó |
Ô |
Ô |
CapitalO,circumflexaccent |
Ô |
Õ |
Õ |
CapitalO,tilde |
Õ |
Ö |
Ö |
CapitalO,umlaut(dierisis) |
Ö |
× |
× |
Multiplysign |
× |
Ø |
Ø |
CapitalO,slash |
Ø |
Ù |
Ù |
CapitalU,graveaccent |
Ù |
Ú |
Ú |
CapitalU,acuteaccent |
Ú |
Û |
Û |
CapitalU,circumflexaccent |
Û |
Ü |
Ü |
CapitalU,umlaut(dierisis) |
Ü |
Ý |
Ý |
CapitalY,acuteaccent |
Ý |
Þ |
Þ |
CapitalThorn,Icelandic |
Þ |
ß |
ß |
Smallsharps,German(szligature) |
ß |
à |
à |
Smalla,graveaccent |
à |
á |
á |
Smalla,acuteaccent |
á |
â |
â |
Smalla,circumflexaccent |
â |
ã |
ã |
Smalla,tilde |
ã |
ä |
ä |
Smalla,umlaut(dierisis) |
ä |
å |
å |
Smalla,ring |
å |
æ |
æ |
Smallaedipthong(ligature) |
æ |
ç |
ç |
Smallc,cedilla |
ç |
è |
è |
Smalle,graveaccent |
è |
é |
é |
Smalle,acuteaccent |
é |
ê |
ê |
Smalle,circumflexaccent |
ê |
ë |
ë |
Smalle,umlaut(dierisis) |
ë |
ì |
ì |
Smalli,graveaccent |
ì |
í |
í |
Smalli,acuteaccent |
í |
î |
î |
Smalli,circumflexaccent |
î |
ï |
ï |
Smalli,umlaut(dierisis) |
ï |
ð |
ð |
Smalleth,Icelandic |
ð |
ñ |
ñ |
Smalln,tilde |
ñ |
ò |
ò |
Smallo,graveaccent |
òò |
ó |
ó |
Smallo,acuteaccent |
ó |
ô |
ô |
Smallo,circumflexaccent |
ô |
õ |
õ |
Smallo,tilde |
õ |
ö |
ö |
Smallo,umlaut(dierisis) |
ö |
÷ |
÷ |
Divisionsign |
÷ |
ø |
ø |
Smallo,slash |
ø |
ù |
ù |
Smallu,graveaccent |
ù |
ú |
ú |
Smallu,acuteaccent |
ú |
û |
û |
Smallu,circumflexaccent |
û |
ü |
ü |
Smallu,umlaut(dierisis) |
ü |
ý |
ý |
Smally,acuteaccent |
ý |
þ |
þ |
Smallthorn,Icelandic |
þ |
ÿ |
ÿ |
Smally,umlaut(dierisis) |
ÿ |