The Flat Trantor Society

C standard quibbles

2014-02-09T15:15:00.002-08:00

This article is a collection of personal quibbles regarding the ISO C standard. Expect it to be updated sporadically.

There have been three major editions of the ISO C standard:

C89/C90: The original ISO C standard was published in 1990, and was closely based on the 1989 ANSI C standard. The 1989 ANSI and 1990 ISO standards describe exactly the same language; ISO added some introductory material and renumbered the sections. This web page appears to be a draft of the ANSI version of the standard. The 1995 amendment added digraphs and wide character support.
C99: The second version was published in 1999. N1256 includes the full 1999 standard with the three Technical Corrigenda merged into it.
C11: The third version was published in 2011. The N1570 draft is freely available, and is very nearly identical to the released standard. (There has been one minor Technical Corrigendum.)

Subtopics (these links work on my GitHub page. TODO: Figure out how to fix that or give up and delete the links):

Is int main() necessarily valid? Should it be?
What is an lvalue?
What is an expression?
Infinite loops
fgetc() when sizeof (int) == 1
More stuff ...

Is `int main()` necessarily valid? Should it be?

ISO C 5.1.2.2.1 Program startup

5.1.2.2.1 defines two permitted definitions for main:

int main(void) { /* ... */}
int main(int argc, char *argv[]) { /* ... */ }

followed by:

or equivalent; or in some other implementation-defined manner.

Which means that compilers may accept void main(void), but are not required to do so (more on that later and elsewhere).

This is a very commonly used definition:

int main() { /* ... */ }

As a definition, it says that main has no parameters. As a declaration, though, it doesn't say that main takes no arguments; rather, it says that main takes an unspecified but fixed number and type(s) of parameters -- and if you call it with arguments that are incompatible with the definition, the behavior is undefined.

I argue that int main() is not equivalent to int main(void), and therefore is not a valid definition unless it's covered by the "or in some other implementation-defined manner" clause (i.e., unless the implementation explicitly documents that it supports it).

int main() { /* ... */ } is an old-style non-prototype definition. Support for such definitions is obsolescent feature (C11 6.11.7).

Furthermore, this program:

int main(void) {
    if (0) {
        main(42);
    }
}

violates a constraint, whereas this program:

int main() {
    if (0) {
        main(42);
    }
}

does not, which implies that the two forms are not equivalent.

I wonder whether those who argue that int main() is valid because it's "equivalent" to int main(void) would make the same argument for:

int main(argc, argv)
int argc;
char *argv[];
{
    /* ... */
}

On the other hand, as long as non-prototype function declarations and definitions are part of the standard, int main() { /* ... */ } probably should be valid. The entire point of continuing to support such definitions and declarations is to avoid breaking pre-ANSI code, written before prototypes were added to the language (it's not as if non-prototype declarations are useful other than for backward compatibility). If int main() is invalid, then no pre-ANSI program is a valid C90, C99, or C11 program, which was surely not the intent.

What is an lvalue?

ISO C 6.2.2.1p1 Lvalues, arrays, and function designators

The definition of the term lvalue (sometimes written l-value) has changed several times over the years. The "L" part of the name was originally an abbreviation of the word "left"; an lvalue can appear on the left hand side of an assignment, and an rvalue can appear on the right hand side.

My (somewhat vague) recollection is that the term "l-value" originally referred to a kind of value, not (as it does now in C) to a kind of expression. Given that n is an integer variable, the expression n could be "evaluated for its l-value" (which identifies the object that it designates, ignoring any value stored in that object), or it could be "evaluated for its r-value" (which retrieves the value stored in the object). The expression n would be evaluated for its l-value if it appeared on the left side of an assignment, or for its r-value in most other contexts. Apparently the terms "l-value" and "r-value" originated in CPL), the ancestor of BCPL, which led to B, which led to C.

Note carefully that, under this definition, an "l-value" is not a pointer value. An "l-value" was the identity of an object, not its address. (Evaluating an expression for its l-value might well involve computing an address internally.)

I've tried and failed to find a reference for these definitions, but a footnote in section 6.3.2.1 of the C standard:

The name "lvalue" comes originally from the assignment expression
**`E1 = E2`**, in which the left operand **`E1`** is required
to be a (modifiable) lvalue.  It is perhaps better considered as
representing an object "locator value".  What is sometimes called
"rvalue" is in this International Standard described as the
"value of an expression".

at least strongly suggests that an rvalue is a value, not an expression that yields a value -- though an lvalue is a kind of expression. The term "rvalue" does not appear anywhere else in the C standard.

But that's all pre-C history.

Kernigan & Ritchie, "The C Programming Language", 1st edition, 1978:

An object is a manipulatable region of storage; an lvalue is an expression referring to an object.

This suffers from the same problem as the later ISO C90 definition; see below.
C90 6.2.2.1:

An lvalue is an expression (with an object type or an incomplete type other than void) that designates an object.

Problem: Though this conveys the intent, it implies that a dereferenced null pointer is not an lvalue, which makes lvalue-ness an execution time property. This is clearly not the intent.
C99 6.3.2.1p1:

An lvalue is an expression with an object type or an incomplete type other than void; if an lvalue does not designate an object when it is evaluated, the behavior is undefined.

Problem: This says that any expression of an appropriate type is an lvalue, which is certainly not the intent. For example, it says that 42 (which is an expression of an object type) is an value, and that since it doesn't designate an object, the behavior of any program containing 42 is undefined. What a mess. The intent is that an the evaluation of an lvalue that's in a context that requires an lvalue has undefined behavior if it doesn't designate an object.
C11 6.3.2.1p1:

An lvalue is an expression (with an object type other than void) that potentially designates an object; if an lvalue does not designate an object when it is evaluated, the behavior is undefined.

This goes back to the C90 definition and adds the word "potentially" (which was my idea, BTW). This clarifies that a dereferenced null pointer is an lvalue, but if it's evaluated in a context that requires an lvalue it has undefined behavior.

I'm still not entirely happy with this, because it's not clear just what "potentially designates" means.

Ultimately, I think that the term lvalue can be defined syntactically. I think you can go through section 6.5 of the standard and determine that certain kinds of expressions are always lvalues, other kinds of expressions never are, and others are an lvalue or not based on criteria that are easy to specify.

An expression is an lvalue if and only if it one of the following:

An identifier that is not a function name or enumeration constant;
A string literal;
A parenthesized expression, if and only if the unparenthesized expression is an lvalue;
An indirection expression *x;
A subscript expression (x[y]) (this follows from the definition of the subscript operator and the fact that *x is an lvalue)
A reference to a struct or union member (x.y, x->y); or
A compound literal.

(I don't guarantee this is 100% correct.)

The standard's definition of lvalue should, IMHO, use a list similar to the above. The description of the intent can still use the wording of the current definition, perhaps as a footnote.

Null pointer constants and parenthesized expressions

ISO C 6.3.2.3 Pointers (under 6.3 Conversions)

Paragraph 3:

An integer constant expression with the value 0, or such an
expression cast to type **`void *`**, is called a *null pointer
constant*.

The problem: 6.5.1 (Primary expressions) says that a parenthesized expression

is an lvalue, a function designator, or a void expression if
the unparenthesized expression is, respectively, an lvalue,
a function designator, or a void expression.

It doesn't say that a parenthesized null pointer constant is a null pointer constant.

Which implies, strictly speaking, that (void*)0 is a null pointer constant, but ((void*)0) is not.

And since 7.1.2 "Standard headers" requires:

Any definition of an object-like macro described in this clause shall
expand to code that is fully protected by parentheses where necessary,
so that it groups in an arbitrary expression as if it were a single
identifier.

this implies that the NULL macro may not be defined as (void*)0, since, for example, that would cause sizeof NULL to be a syntax error.

I'm sure that most C implementations do treat a parenthesized null pointer constant as a null pointer constant, and define NULL either as 0, ((void*)0), or in some other manner.

What is an expression?

ISO C 6.5 Expressions

The syntax and semantics of expressions are described in section 6.5 of the ISO C standard (which covers 30 pages). But the formal definition of the word "expression" is in 6.5p1:

An expression is a sequence of operators and operands that specifies computation of a value, or that designates an object or a function, or that generates side effects, or that performs a combination thereof.

That sounds reasonable -- except that a strict reading of that definition implies that 42 is not an expression. Why not? It contains no operators, and 42 can't be an operand if there is no operator, so it's not "a sequence of operators and operands".

The real definition of expression is syntactic; anything that satisfies the syntactic definition of expression (in 6.5.17, and referring to definitions in the rest of section 6.5) is an expression.

The definition in 6.5p1 either needs to be re-worded so that it includes primary expressions, or it needs to refer to the grammar. A more reader-friendly (but perhaps less precise) English description of what an expression is should still be included.

Integer constant expressions

ISO C 6.6.6 Constant expressions, paragraph 6

Credit for this goes to Stack Overflow user pablo1977 who posted this question.

6.6.6p6 says:

An integer constant expression shall have integer type and shall only have operands that are integer constants, enumeration constants, character constants, sizeof expressions whose results are integer constants, _Alignof expressions, and floating constants that are the immediate operands of casts. Cast operators in an integer constant expression shall only convert arithmetic types to integer types, except as part of an operand to the sizeof or _Alignof operator.

The problem: There's no indication that a parenthesized constant is a constant. So (int)3.14 is a constant expression, but (int)(3.14), strictly speaking, is not, because 3.14 is a floating constant but (3.14) is not.

It seems obvious that if (int)3.14 is an integer constant expression, then there's no reason that (int)(3.14) shouldn't be one as well, and though I haven't checked I suspect that all existing compilers treat it as one. If the wording of the standard is to be corrected, some care will have to be taken so that both (int)(3.14) and (int)((3.14)) are integer constant expressions

Infinite loops

ISO C 6.8.5 Iteration statements, paragraph 6

This was a change made in ISO C 2011.

6.8.5p6 says:

An iteration statement whose controlling expression is not a constant expression, that performs no input/output operations, does not access volatile objects, and performs no synchronization or atomic operations in its body, controlling expression, or (in the case of a for statement) its expression-3, may be assumed by the implementation to terminate.

with a footnote:

This is intended to allow compiler transformations such as removal of empty loops even when termination cannot be proven.

So this clause is all about enabling optimizations, and I'm guessing that it was influenced by the C compiler implementers on the committee.

I presume that they had good reasons for adding this, and that it makes a signicant difference in the performance of real-world code. And if you want to write an infinite loop deliberately, you can still do so because of the "constant expression" exception.

But it means that I can write code whose behavior is well defined in terms of pre-2011 C, and that can behave differently in C11. For example:

const int keep_going = 1;
while (keep_going) {
    ;
}
puts("This should never appear");

In C90 and C99, the message "This should never appear" will never be printed. In C11, because keep_going is not a constant expression, the compiler can legally assume that the loop terminates, and the message may or may not be printed.

I'd be interested in seeing cases where this additional permission is actually helpful.

Furthermore, I find the way this permission is worded to be clumsy. It's a statement about what the implementation is permitted to assume. What really matters is what the implementation is permitted to do. A better and more consistent way of expressing this, I think, would have been something like:

If an iteration statement whose controlling expression is not a constant expression, that performs no input/output operations, does not access volatile objects, and performs no synchronization or atomic operations in its body, controlling expression, or (in the case of a for statement) its expression-3 does not explicitly terminate, it is unspecified whether it terminates or not.

Or it could say that if such a loop does not terminate, the behavior is undefined -- but that would give compilers much more latitude than the current wording.

`fgetc()` when `sizeof (int) == 1`

ISO C 7.21.7.1 The `fgetc` function

The standard makes some implicit assumptions about how character input works. If sizeof (int) == 1 (which requires CHAR_BIT >= 16), EOF isn't distinct from any valid char value. I think there are also some assumptions about how unsigned-to-signed conversion works; the result is implementation-defined, but some possible implementation definitions would break stdio character input. I need to study this further.

More stuff ...

... as I think of it.

Last updated Wed Mar 19 14:42:43 2014 -0700

Where should the control key be?

2013-12-28T15:03:00.002-08:00

Almost all modern computer keyboards place the Caps Lock key immediately to the left of A, with the Shift key below it (next to Z) and the Control key below that, in the lower left corner.

It wasn't always this way.

For example, many of Sun's keyboards (images here) put the Control key immediately to the left of A, and the Caps Lock key in the lower left corner.

If you happen to like the "modern" layout, that's great; I'm not going to try to change your mind, and you can feel free to stop reading now.

But personally, I find it much easier to type when the Control key is immediately to the left of the A key, and the Caps Lock (which I hardly ever use) is either safely out of easy reach or disabled altogether. I use control sequences extensively. I'm a heavy user of vim, I occasionally use Emacs, and I use Emacs-style key bindings in the bash shell. Reaching my left pinky finger down below the shift key every few seconds is quite awkward, but if the control key is on the home row I don't even have to think about it. Yes, I've tried using keyboards with Control below Shift; no, I've never been able to get used to it.

Fortunately, there are ways to remap your keyboard in software so that the key labeled "Caps Lock" acts as a Control key. Unfortunately, those ways vary considerably from one operating system to another.

Microsoft Windows:

Microsoft Windows does let you do some limited keyboard remapping through the Control Panel (in Windows 7 at least, it's under "Region and Language", not under "Keyboard") -- but for some unfathomable reason there's no option to remap the Caps Lock and Control keys.

You can swap the Control and Caps Lock keys, or make Caps Lock an additional Control key, by modifying the system registry. I provide instructions for doing so here. Unfortunately, this is a system-wide setting; it doesn't let you change the layout for an individual user. I advise not applying this registry patch to a shared Windows system unless you're sure that all users of the system are ok with a "non-standard" keyboard layout.
Linux (or GNU/Linux if you prefer):

Fortunately, Linux-based systems generally do let you modify keyboard layouts on a per-user basis. The specific method can vary depending on which distribution and desktop environment you use. One of the following methods is likely to work.

See also this question and this answer on unix.stackexchange.com.

**UNIX-like command-line solutions:

Either of the following commands should work to map Caps Lock to Control (making both keys act like a Control key) for the duration of the current X session:

xmodmap -e 'clear Lock' \ -e 'keycode 0x42 = ControlL' \ -e 'add Control = ControlL'

or:

setxkbmap -option ctrl:nocaps

I think the setxkbmap command is newer; you might have to resort to xmodmap for some older systems.

I think that

setxkbmap -option ctrl:swapcaps

will swap the Control and Caps Lock keys, but I haven't tried it..

Both of these have the drawback that the behavior will revert to the default when the current X session terminates (typically when you log out or reboot). You can either re-execute the command on startup, or arrange for the system to do it for you.

I find it more convenient, where possible, to do this through the desktop GUI, so the setting is persistent across reboots.

Debian 6, Gnome desktop:
- "System" > "Preferences" > "Keyboard"
- Select the "Layouts" tab
- Highlight the layout you use (mine is "USA")
- Click the "Options" button
- Under "Ctrl key position", select "Make CapsLock an additional Ctrl", or whichever option you prefer.
Linux Mint 14, Cinnamon desktop:
- From the "System Tools" menu, select "System Settings", then open "Keyboard Layout"
- Select the "Layouts" tab
- Click the "Options..." button.
- Open "Caps Lock key behavior" and select the option you prefer. I use "Make Caps Lock an additional Control but keep the Caps_Lock keysym", which makes both Caps Lock and Control act as a Control key.
Linux Mint 15, Cinnamon destkop:
- From the "System Tools" menu, select "System Settings", then open "Regional Settings"
- Select the "Layouts" tab
- Click the "Options..." button.
- Open "Caps Lock key behavior" and select the option you prefer. I use "Make Caps Lock an additional Control but keep the Caps_Lock keysym", which makes both Caps Lock and Control act as a Control key.
Linux Mint 16, KDE desktop:
- From the main menu, select "Applications", then "Settings", then "System Settings".
- Under "Hardware", open "Input Devices"
- Keyboard settings are shown by default; open the "Advanced" tab.
- Click the "Control keyboard options" checkbox.
- Open "Ctrl Key Position"
- Enable and select "Caps lock as Ctrl" or "Swap Ctrl and Caps Lock"
Linux Mint 17, Xfce desktop: Oddly, the Xfce settings GUI doesn't seem to have an option to change the behavior of the Caps Lock key. See "UNIX-like command-line solutions" above.

Modifying /etc/default/keyboard will affect all users on the system.

Linux virtual console: This web page discusses various ways to remap the control key in the Linux virtual console. (This is the text-only console reachable by typing Ctrl-Alt-F1, Ctrl-Alt-F2, etc.). The most straightforward method seems to be:
- Add the line XKBOPTIONS="ctrl:nocaps" to /etc/default/keyboard
- $ sudo dpkg-reconfigure -phigh console-setup
Replace nocaps by swapcaps if you prefer to swap Control and Caps-Lock rather than making both keys act like Control keys.

I've tried this on Debian 6, and it works after a reboot.
Mac OS X 10.5.8:
- System Preferences
- Keyboard & Mouse
- Keyboard tab > Modifier Keys ...
- Change Caps Lock to act as Control
- Optional: Change Control to act as Caps Lock

Last updated 2014-07-30 17:46:23 -0700

Markdown

2012-11-05T16:29:00.002-08:00

I've decided to start composing and maintaining this blog using Markdown.

If you're not familiar with it (or even if you are), Markdown is a text-to-HTML conversion tool for web writers. Raw Markdown is much more readable and easier to work with than raw HTML. It doesn't directly provide the full power of HTML, though you can include raw HTML in a Markdown document -- and you can do italics, bold, and bold italics directly in Markdown.

It's used (in slightly different flavors) on GitHub and on the StackExchange network of sites, among other places.

All posts on this blog are maintained as a GitHub project. If you're sufficiently curious, you can see the Markdown form of all the articles, and how I've revised them over time.

One thing I've noticed with the composition software used by blogspot.com is that switching between the "HTML" and "Compose" views changes the HTML; in particular, it removes <p> paragraph markup, replacing it by <br /> line breaks. Because of this I need to copy the Markdown-generated HTML into the HTML window and click the Update button without looking at the preview. Annoying, but not fatal.

Markdown is converted to HTML by the markdown command, which is also available as a .deb package on Debian and Debian-derived systems such as Ubuntu and Linux Mint:

sudo apt-get install markdown

It should be available for other systems as well. I run a simple gen-html script (included in the GitHub project for this blog), and then manually copy-and-paste the generated HTML into blogspot.com's web interface. The manual step is annoying, but overall it should make it easier to write and maintain this blog.

Pandoc is another good conversion tool, handles numerous other formats as well. It should be available for most systems.

Who knows, I might even get around to posting more articles!

Last updated Sat 2013-12-28 16:29:29 PST

No, strncpy() is not a "safer" strcpy()

2012-03-05T16:35:00.000-08:00

The C standard library declares a number of string functions in the standard header <string.h>.

By the standards of some other languages, C's string handling is fairly primitive. Strings are simply arrays of characters terminated by a null character '\0', and are manipulated via char* pointers. C has no string type. Instead, a "string" is a data layout, not a data type. Quoting the ISO C standard:

A string is a contiguous sequence of characters terminated by and including the first null character.

So what happens if you call a C string function with a pointer into a char array that isn't properly terminated by a null character? Such an array does not contain a "string" in the sense that C defines the term, and the behavior of most of C's string functions on such arrays is undefined. That doesn't mean the function will fail cleanly, or even that your program will crash; it means that as far as the standard is concerned, literally anything can happen. In practice, what typically happens is that the function will keep looking for that terminating null character either until it finds it in some chunk of memory it really shouldn't be looking at, or until it crashes because it looked in some chunk of memory that it really shouldn't be looking at.

To partially address this, C provides "safer" versions of some string functions, versions that let you specify the maximum size of an array. For example, the strcmp() function compares two strings, but can fail badly if either of the arguments points to something that isn't a string. The strncmp() function is a bit safer; it requires a third argument that specifies the maximum number of characters to examine in each array:

int strcmp (const char *s1, const char *s2);
int strncmp(const char *s1, const char *s2, size_t n);

Which brings us (finally!) to the topic of this article: the strncpy() function.

strcpy() is a fairly straightforward string function. Given two pointers, it copies the string pointed to by the second pointer into the array pointed to by first. (The order of the arguments mimics the order of the operands in an assignment statement.) It's up to the caller to ensure that there's enough room in the target array to hold the copied contents.

So you'd think that strncpy() would be a "safer" version of strcpy(). And given their respective declarations, that's exactly what it looks like:

char *strcpy (char *dest, const char *src);
char *strncpy(char *dest, const char *src, size_t n);

But no, that's not what the strncpy() function does at all.

Here's the description of strcpy() from the latest draft of the C standard:

The strcpy function copies the string pointed to by s2 (including the terminating null character) into the array pointed to by s1. If copying takes place between objects that overlap, the behavior is undefined.

And here's the corresponding description of strncpy():

The strncpy function copies not more than n characters (characters that follow a null character are not copied) from the array pointed to by s2 to the array pointed to by s1. If copying takes place between objects that overlap, the behavior is undefined.

So far, so good, right? Almost -- but there's more:

If the array pointed to by s2 is a string that is shorter than n characters, null characters are appended to the copy in the array pointed to by s1, until n characters in all have been written.

That second paragraph means that if the string pointed to by s2 is shorter than n characters, it doesn't just copy n characters and add a terminating null character, which is what you'd expect. It adds null characters until it's copied a total of n characters. If the source string is 5 characters long, and the target is a 1024-byte buffer, and you set n to the size of the target, strncpy will copy those 5 characters and then fill all 1019 remaining bytes in the target with null characters. Since all it takes to terminate a string is a single null character, this is almost always a waste of time.

Ok, so that's not so bad. CPUs are fast these days, and filling a buffer with zeros is not an expensive operation, right? Unless you're doing it a few billion times, but let's not worry about premature optimization.

The trap is in that first paragraph. If the target buffer is 5 characters long, you'd quite reasonably set n to 5. But if the source string is longer than 5 characters, then you'll end up without a terminating null character in the target array. In other words, the target array won't contain a string. Try to treat it as if it does (say, by calling strlen() on it or passing it to printf()), and Bad Things Can Happen.

The description of the strcpy() and strncpy() functions is identical in the 1990, 1999, and 2011 versions of the ISO C standard -- except that C99 and C11 add a footnote to the strncpy() description:

Thus, if there is no null character in the first n characters of the array pointed to by s2, the result will not be null-terminated.

The bottom line is this: in spite of its frankly misleading name, strncpy() isn't really a string function.

[TODO: Discuss dest[0]='\0'; strncat(dest, src, size); as a better-behaved alternative, something that does what most people assume strncpy() does.]

Now having a function like this in the standard library isn't such a bad thing in itself. It's designed to deal with a specialized data structure, a fixed-size character array of N characters that can contain up to N characters of actual data, with the rest of the array (if any) padded with 0 or more null characters. Early Unix systems used such a structure to hold file names in directories, for example (though it's not clear that strncpy() was invented for that specific purpose).

The problem is that the name strncpy() strongly implies that it's a "safer" version of strcpy(). It isn't.

Most of the other strn*() functions are safer versions of their unbounded counterparts: strcat() vs. strncat() and strcmp() vs strcmp(). [TODO: Discuss the bounds-checking versions added in Annex K of the 2011 ISO C standard).

It's because strncpy()'s name implies something that it isn't that it's such a trap for the unwary. It's not a useless function, but I see far more incorrect uses of it than correct uses. This article is my modest attempt to spread the word that strncpy() isn't what you probably think it is.

I've put together a small demo as a GitHub project.

Last updated Mon Feb 17 08:33:27 2014 -0800

First post

2012-01-13T21:01:00.000-08:00

Greetings to my vast army of followers.

This is my new blog, in which I will sporadically post rants and musings on software development, programming language standards, and whatever else strikes my fancy at the moment.

Welcome.

Last updated Mon 2012-11-05 16:48:00 PST