03.md

Recap of the last class

Comments, preprocessor, expressions, break, operators, _Bool, integer types, signedness, modifiers for printf(), L/U suffices, sizeof operator, integer literals, getchar().

Warm-up

Print out number of words read from the standard input. Ideally, use your getchar() based code from your repository to start with.

Always try to reuse code you have already written!

A word is defined as a group of any characters surrounded by whitespace characters. For our case these white space characters will suffice: a tabelator, a space, and a newline.

  • write your own check for a whitespace character, do not use library functions for that

  • check correctness of your implementation with wc -w <file>

  • what happens if the number of words exceeds the type that stores the count?

👀 words.c actually uses a library check for white space for simplicity.

another solution is 👀 words2.c It is even simpler than the first solution while not using the library function.

Example:

$ cat /etc/passwd | ./a.out
70

Array Intro

You define an array like this <element-type> array[<some-number>], for example:

int array[5];

The integer value n in [n] specifies the number of array elements.

  • an array subscript always starts from 0 and ends with n - 1
  • a subscript may be any integer expression
  • the type of array elements is called an element type

So, int array[3] elements will be accessible as a[0] .. a[2], each element is of type int, and therefore you can do e.g. printf("%d\n", a[2]);.

  • 0 as the first subscript so it is easier to work with pointers and efficiency for array access - we will get to pointers later.

  • what is not possible to do with arrays in C (limitations are important knowledge):

    • associative arrays
    • array subscripts returning a sub-array (like found e.g. in Python)
    • assigning to an array as a whole
  • as with integer and floating point variables, we may initialize an array during its definition. In general, the value for the initialization is known as an initializer.

short array[3] = { 1, 2, 3 };

If the array size is omitted the compiler will compute the size from the the number of initializers. So, you can just do the following.

short array[] = { 1, 2, 3 };
char array[] = { 'h', 'e', 'l', 'l', 'o' };

Note that if you need your array to contain only the elements from the initializer, omitting the array size is the way to go to avoid errors as in changing the initializer while forgetting to update the array size.

  • the sizeof operator on array always gets the array size in bytes. It will not get the array size in elements.

    • to get the number of elements in an array, you must divide the array size in bytes by the size of its element. Always use 0 subscript, see below on why.
int a[5];

printf("Elements in array: %zu\n", sizeof (a) / sizeof (a[0]));

The above is the correct approach that is immune to changing the array declaration (i.e. the type of elements). Do not use the following:

    sizeof (array) / sizeof (int)

🔧 Declare an array of chars without setting the array size, initialize the array during declaration (see above), then print each element of the array on a separate line using a while loop.

Arrays introduced so far are not dynamic and can not be resized.

  • try to perform out-of-bound access. What is the threshold for behavior change on your system ?
  • why is it not faulting for the one-off error?

👀 array-out-of-bounds.c

$ ./a.out
Number of array elements: 1024
One-off error (using index 1024)... OK
Assigning to index 4096... Segmentation Fault

You do not need to initialize all elements. With such type of an initialization, you always start from subscript 0, and there are no gaps:

  short array[4] = { 1, 2, 3 };

In the example above, the elements not explicitly initalized are set to 0 so the value of array[3] will be initialized to 0.

  • i.e. int array[100] = { 0 }; will have all values set to 0

  • the initialization is done by a compiler

  • using = {} is not allowed by the C specification (allowed in C++) but generally accepted. Not with -Wpedantic though:

cc -Wpedantic test.c
test.c:1:13: warning: use of GNU empty initializer extension
      [-Wgnu-empty-initializer]
int a[10] = {};
	    ^
1 warning generated.

Note: global variables are always zeroized.

There is a partial array initialization where the initializers are called designated initializers in the C spec:

char array[128] = { [0] = 'A', [2] = 'f', [4] = 'o', [6] = 'o' };

  • a subscript is in square brackets
  • the [subscript] is known as a designator. Inreasing ordering is not required but expected.
  • the rest of elements will be initialized to zero
  • if you do not specify the array size, it is taken from the highest designator index
  • you can combine designators with fixed order initializers, and you always start with the next index. For example:
  /* a[5] is 'e' etc.  a[0]..a[3] are NUL characters (= zero bytes). */
  char a[128] = { [4] = 'h', 'e', 'l', 'l', 'o' };
  • you cannot specify a repetition or setting multiple elements to the same value (there is a gcc extension for that but let's not go there).

👀 array-designated-initializers.c

Note the code above mentions a missing = as a GCC extension. With a non-GCC compiler it does not compile:

$ cc array-designated-initializers.c
"array-designated-initializers.c", line 15: syntax error before or at: 'A'
cc: acomp failed for array-designated-initializers.c

Once declared, the values cannot be assigned at once. So, you can only do things as follows:

int array[4];

array[0] = 1;
array[1] = 2;
array[2] = array[3] = 3;
// ...

You cannot assign an array into array - has to be done an element by element.

  • likewise for comparison

Arrays cannot be declared as empty (int a[0]).

  • this is explicitly forbidden by the standard, see C99 6.7.5.2 Array declarators.
  • GCC accepts that though. Do not use it like that.

👀 empty-array.c

This might be a bit confusing though:

$ gcc empty-array.c
$ ./a.out
4
0

Even if a compiler supports an empty array declaration, sizeof() / sizeof(a[0]) construction is always correct to compute a number of array elements. Plus, the compiler does not do any array access when computing sizeof(a[0]) as the expression is not evaluated (see lecture 02 on the sizeof operator), the compiler only uses the argument to determine the size.

🔧 Home assignment

Note that home assignments are entirely voluntary but writing code is the only way to learn a programming language.

Do digit occurence as in the last class but now use an array for counting the figures.

Function introduction

Functions can be used to encapsulate some work, for code re-factoring, etc.

A function has a single return value and multiple input parameters.

  • if you also need to extract an error code with the value or get multiple return values that needs to be done via passing data via reference (more on that when we have pointers)
  • i.e. this is not like Go that returns the error along with the data

A function declaration is only declaring the API without its body. It is called a function prototype and it consists of a type, a function name, and an argument list in parentheses:

int digit(int c);
int space(int c);
float myfun(int c, int i, float f);

When defining the function, its body is inclosed in {} (just like the main function).

int
return_a_number(void)
{
	return (1000);
}

Also as with the main function, you can use void instead of the argument list to say the function accepts no arguments.

void
hola(void)
{
	printf("Hola!\n");
}

A function call is an expression so you can you it as such:

printf("%d\n", return_a_number());

The default return value type is an int but a compiler will warn if it is missing. You should always specify the type in C99+ . You may use void which means the function does returns no value.

Write your function declarations at the beginning of a C file or include those into a separate header file.

If a compiler hits a function whose prototype is unknown, warnings are issued. Remember 👀 hello-world1.c ? See also here:

$ cat test.c
int
main(void)
{
	fn();
}

float
fn(void)
{
	return (1.0);
}
$
$ gcc test.c
test.c:4:2: warning: implicit declaration of function 'fn' is
invalid in C99
      [-Wimplicit-function-declaration]
	fn();
	^
test.c:8:1: error: conflicting types for 'fn'
fn(void)
^
test.c:4:2: note: previous implicit declaration is here
	fn();
	^
1 warning and 1 error generated.

Argument names may be omitted in prototypes, i.e.:

int myfn(int, int);

A variable defined in function parameters or locally within the function overrides global variables. Each identifier is visible (= can be used) only within a region of program text called its scope. See C99, section 6.2.1 Scopes of identifiers for more information.

Within a function body, you may use its arguments as any other local variables.

int
myfn(int i, int j)
{
	i = i * j;
	return (i);
}

As arguments are always passed by value in C, the variable value is not changed in scope that called the function.

/* myfn defined here */

int
main(void)
{
	int i = 3;

	myfn(i, i);
	printf("%d\n", i);	// will print 3
}

Local variables are on the stack.

Variable passing depends on bitness and architecture. E.g. 32-bit x86 on the stack, 64-bit x64 ABI puts first 6 arguments to registers, the rest on the stack.

Functions can be recursive. 👀 recursive-fn.c

/*
 * Print 0..9 recursively using a function call fn(9).
 */
#include <stdio.h>

void
myfn(int depth)
{
	if (depth != 0)
		myfn(depth - 1);
	printf("%d\n", depth);
}

int
main(void)
{
	myfn(9);
}

🔧 Print 9..0 using a recursive function.

solution 👀 recursive-fn2.c

🔧 Home assignment

Note that home assignments are entirely voluntary but writing code is the only way to learn a programming language.

🔧 Count digits, white space characters, and the rest

Print total number of (decimal) numbers, white space (tab, space, newline) and anything else from the input. I.e. the program prints out three numbers.

Obviously, reuse code you wrote for digit occurence counting: 👀 count-digit-occurence.md

  • write the program using direct comparisons like this:
c >= '0' && c <= '9'
c == ' ' || c == '\t' || c == '\n'
  • put both the checks into separate functions
  • then write a new version of the program and use: - use isspace() from C99
    • isdigit() vs isnumber() - the latter detects digits + possibly more characters (depending on locale setting)

🔧 Count letter occurrence

  • count occurrences of all ASCII letters on standard input and print out a histogram

  • this will be case insensitive (i.e. implement mytolower(int) first)

  • the output will look like this:

$ cat /some/file | ./a.out
a: ***
b: *
c: *****************************
...
z: *******
$
  • use a function to print a specific number of stars
  • use a function to do all the printing
  • declare array(s) to be as efficient as possible
    • that is, having the lowest possible size
    • only use global variables if necessary

Arrays and functions

Arrays in C are not a first class object, rather it is a composition of elements.

An array cannot be returned from a function. A pointer to an array can be but more on that later.

No allowing to return an array is done for efficiency as copying the whole array (by value) would be too expensive.

  • watch for size of an array as a local variable in a function
  • arrays as local variable may significantly increase the stack size (and a stack size is limited in threaded environments)

👀 func-large-array.c

The following may or may not happen in your environment:

$ cc func-large-array.c
$ ./a.out
Segmentation fault: 11

Array of characters

An array of characters (chars) is called a string or a string constant. Do not confuse it with a character constant as in 'A' (single quotes) that is actually an int. A string is put in double quotes "".

So, "hello, world" as string.

Technically, a string constant is an array of characters and may be used to initialize a char array:

char foo[] = "bar";

A special thing is that a string is always terminated by a NUL (zero) byte. That is how C knows where a string ends.

That means the size of the array is one byte more than the number of characters in the string. It is because of the terminating zero (\0) that the compiler adds to terminate the string.

So, you could use an initializer as introduced earlier:

char foo[] = { 'b', 'a', 'r', '\0' };

printf("%zu\n", sizeof (foo));		// prints 4
printf("%zu\n", sizeof ("bar"));	// prints 4

However, that is not generally used as you would normally use "bar" to initialize such an array.

Using '\0' suggests it is a terminating NUL character, and it is just a zero byte. So, you could just use 0, like this, but it is not generally used:

char foo[] = { 'b', 'a', 'r', 0 };

To print a string via printf(), you use the %s conversion specifier:

printf("%s\n", foo);

🔧 What happens if you forget to specify the terminating zero in the above per-char init and try to print the string ?

👀 array-char-nozero.c

Why does it work?

Let's look at the following code. Obviously, the function does not return its value while it definitely should have. What happens if we use the function return value anyway?

#include <stdio.h>

int
addnum(int n1, int n2)
{
        int n = n1 + n2;
}

int
main(void)
{
        printf("%d\n", addnum(1, 99));
}

It simply cannot work, right? Oh, wait..., it does!?!

$ cc main.c
$ ./a.out 
100

So what happened???

Well, I does not really work. It "works" by accident. Let's look at it more closely. To establish a common environment, I will be using u-pl3.ms.mff.cuni.cz which you all have access to, so that you can try and see. Before I begin, let me give you more information:

  • it depends on a system and its version, and a compiler and its version
    • you can also try the clang compiler later on the same machine
  • will be using gcc which defaults to generate 64 bit binaries on the Linux distro installed on u-pl3.ms.mff.cuni.cz (and other machines in the lab).
  • 64 bit binaries on x86 use X86-64 ABI
  • ABI is not API
  • by this ABI (it's law!), the first two function integer arguments are passed through the general purpose edi and esi registers
  • integer function return value is passed back to the caller via the general purpose eax register
  • the stack on x86 grows down

Let's compile the code and disassemble it. All we need are the main and addnum function. See my notes inline.

$ cc main.c
$ objdump -d a.out
...
<removed non-relevant stuff>
...
0000000000001145 <addnum>:
    1145:	55                   	push   %rbp
    1146:	48 89 e5             	mov    %rsp,%rbp

    	- initialize a frame pointer for this function.

    1149:	89 7d ec             	mov    %edi,-0x14(%rbp)
    114c:	89 75 e8             	mov    %esi,-0x18(%rbp)

	- here we put both function arguments on a stack.  As we already
	  learned, function arguments may be used within a function as local
	  variables, so they need to be on a stack.

	- why we put them to -0x14 and -0x18 offsets from the base?  That's just
	  what this gcc version does.  They are 4 bytes apart as we work with 4
	  byte integers.

    114f:	8b 55 ec             	mov    -0x14(%rbp),%edx
    1152:	8b 45 e8             	mov    -0x18(%rbp),%eax

	- moved the values of our local variables to general purpose registers

    1155:	01 d0                	add    %edx,%eax

    	- sum up the values.  As we add edx to eax, the result is in eax.

    1157:	89 45 fc             	mov    %eax,-0x4(%rbp)

	- put the result to the local variable "n" which happens to be at offset
	  -0x4 from the frame pointer.

	- now, see above, register eax is used in x86 64 ABI for the function
	  return value.  So, by accident, we have the "right" value in the
	  register that is supposed to hold the return value!!!

    115a:	90                   	nop
    115b:	5d                   	pop    %rbp
    115c:	c3                   	retq   

000000000000115d <main>:
    115d:	55                   	push   %rbp
    115e:	48 89 e5             	mov    %rsp,%rbp
    1161:	be 63 00 00 00       	mov    $0x63,%esi
    1166:	bf 01 00 00 00       	mov    $0x1,%edi

	- put the values 1 and 99 to the registers representing the 1st and the
	  2nd argument in x86 64 bit ABI

    116b:	e8 d5 ff ff ff       	callq  1145 <addnum>

	- called our function from main()

    1170:	89 c6                	mov    %eax,%esi

	- x86 64 ABI expects the function return value in the register eax, so
	  we just put it to the input register esi representing the 2nd argument
	  for the printf() function.
	- and, given that we happened to have the right value in eax, the code
	  looks like it works.

    1172:	48 8d 3d 8b 0e 00 00 	lea    0xe8b(%rip),%rdi        # 2004 <_IO_stdin_used+0x4>
    1179:	b8 00 00 00 00       	mov    $0x0,%eax
    117e:	e8 bd fe ff ff       	callq  1040 <printf@plt>
    1183:	b8 00 00 00 00       	mov    $0x0,%eax
    1188:	5d                   	pop    %rbp
    1189:	c3                   	retq   
    118a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)

What if we increment variable n, does it change anything?

int
addnum(int n1, int n2)
{
	int n = n1 + n2;
	++n;
}

$ gcc main.c 
./a.out 
100

Apparently not, let's see the disassembled code:

0000000000001145 <addnum>:
    1145:	55                   	push   %rbp
    1146:	48 89 e5             	mov    %rsp,%rbp
    1149:	89 7d ec             	mov    %edi,-0x14(%rbp)
    114c:	89 75 e8             	mov    %esi,-0x18(%rbp)
    114f:	8b 55 ec             	mov    -0x14(%rbp),%edx
    1152:	8b 45 e8             	mov    -0x18(%rbp),%eax
    1155:	01 d0                	add    %edx,%eax
    1157:	89 45 fc             	mov    %eax,-0x4(%rbp)
    115a:	83 45 fc 01          	addl   $0x1,-0x4(%rbp)

    	- OK, the compiler just directly adds 1 to the memory address holding
	  local variable "n".  As it does not change the register eax, it still
	  "works".

    115e:	90                   	nop
    115f:	5d                   	pop    %rbp
    1160:	c3                   	retq

So, let's involve another local variable.

int
addnum(int n1, int n2)
{
	int n = n1 + n2;

	int i = 13;
	n = i + n;
}

$ gcc main.c
$ ./a.out
13

Bingo! It no longer works. What happened?

0000000000001145 <addnum>:
    1145:	55                   	push   %rbp
    1146:	48 89 e5             	mov    %rsp,%rbp
    1149:	89 7d ec             	mov    %edi,-0x14(%rbp)
    114c:	89 75 e8             	mov    %esi,-0x18(%rbp)
    114f:	8b 55 ec             	mov    -0x14(%rbp),%edx
    1152:	8b 45 e8             	mov    -0x18(%rbp),%eax
    1155:	01 d0                	add    %edx,%eax

    	- this was summing up our two function arguments

    1157:	89 45 f8             	mov    %eax,-0x8(%rbp)

	- the result went to the local variable "n"

    115a:	c7 45 fc 0d 00 00 00 	movl   $0xd,-0x4(%rbp)

	- we initialized local variable "i" (0xd in hexa is 13 in decimal)

    1161:	8b 45 fc             	mov    -0x4(%rbp),%eax

	- See?  Now we used eax for further arithmetic operations.  We just put
	  the current value of local variable "i" to the register.

    1164:	01 45 f8             	add    %eax,-0x8(%rbp)

	- and we added the register value to the present value of local variable
	  "n".  And as 13 was left in eax, that served as the function return
	  value.

    1167:	90                   	nop
    1168:	5d                   	pop    %rbp
    1169:	c3                   	retq

🔧 Try the clang compiler and figure out what happened there.

$ clang main.c
main.c:7:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^
1 warning generated.
$ ./a.out
0

What we should take away from this situation:

  • anything looking as working does not mean it is correct
  • ideally, if possible, test on different architectures (like x86, SPARC, ARM, etc).
  • using different compilers and different systems may help as well. For example, if you develop in a Linux distro, testing it also on a macOS laptop would be worth it.
  • if something magically stops working that did work before, be ready for breakage like this. Even something that has worked for ages does not necessarily means the code must have been correct.
  • always use -Wall -Wextra GCC options when building your code. More on that later in the seminar. See that clang warns by default which is a good thing.

Do you want more information on x86 ABI and how it really works behind the scene? Search for Solaris Crashdump Analysis Frank Hofmann for more details.

You probably do not want all the gory details though...