03.md

Warm-up

Count words

Print out number of words read from the standard input. Ideally, use your getchar based code from your repository to start with.

Always try to reuse code you have already written! Then do not forget to upload it to your source code management repository.

A word is defined as a group of any characters surrounded by whitespace characters. Those are: a tabelator, a space, and a newline.

  • Write your own check for a whitespace character, do not use library functions for that.

  • Check correctness of your implementation with wc -w <file>

  • What happens if the number of words exceeds the type that stores the count?

🔑 words.c Does uses a library check isspace for white space for simplicity.

🔑 words2.c It is even simpler than the first solution while not using the library function.

Example:

$ cat /etc/passwd | ./a.out
70

Array Intro

You define an array like this <element-type> array[<some-number>], for example:

int array[5];

The integer value n in [n] specifies the number of array elements.

  • An array subscript (also called an index) always starts from 0 and ends with n - 1
  • A subscript may be any integer expression
  • The type of array elements is called an element type

So, in array int array[3], elements will be accessible as a[0] .. a[2]. Each element is of type int, and therefore you can do e.g. printf("%d\n", a[2]);.

  • 0 was chosen as the first subscript so that it was easier to work with pointers, and also for array access efficiency - we will get to pointers later.

  • What is not possible to do with arrays in C (limitations are important knowledge):

    • Associative arrays
    • Array subscripts returning a sub-array (like found e.g. in Python)
    • Assigning to an array as a whole
  • As with integer and floating point variables, we may initialize an array during its definition. In general, the contruct for any variable initialization (not just arrays) is known as an initializer.

short array[3] = { 1, 2, 3 };	// "{ 1, 2, 3 }" is the initializer

If the array size is omitted the compiler will compute the size from the number of initializers. So, you can just do the following:

short array[] = { 1, 2, 3 };
char array[] = { 'h', 'e', 'l', 'l', 'o' };

If you need your array to contain only the elements from the initializer, omitting the array size is the way to go to avoid errors in modifying the initializer while forgetting to update the array size.

  • The sizeof operator on array always gets the array size in bytes. It will not get the array size in elements.

    • To get the number of elements in an array, you must divide the array size in bytes by the size of its element. Always use 0 subscript, see below on why.
int a[5];

printf("Elements in array: %zu\n", sizeof (a) / sizeof (a[0]));

The above is the correct approach that is immune to changing the array declaration (i.e. the type of elements). As arrays may not be empty, there is always an element [0] (see below). Do not use the following:

sizeof (array) / sizeof (int)

🔧 Declare an array of chars without setting the array size, initialize the array during declaration (see above), then print each element of the array on a separate line using a while loop.

Arrays introduced so far are not dynamic and can not be resized.

  • Try to perform an out-of-bound access. What is the threshold for behavior change on your system ?
  • Why is it not faulting for the one-off error?

👀 array-out-of-bounds.c

$ ./a.out
Number of array elements: 1024
One-off error (using index 1024)... OK
Assigning to index 4096... Segmentation Fault

You can also try to locate where it crashed. For more information, see some info on debugging.

You do not need to initialize all elements. With such type of an initialization, you always start from subscript 0, and there are no gaps:

short array[4] = { 1, 2, 3 };

In the example above, the elements not explicitly initalized are set to 0 so the value of array[3] will be initialized to 0.

  • I.e. int array[100] = { 0 }; will have all values set to 0

  • The initialization is done in code added by the compiler. That code is part of the runtime system of the language.

  • Using = {} is not allowed by the C specification (allowed in C++) but generally accepted. Not with -Wpedantic though:

cc -Wpedantic test.c
test.c:1:13: warning: use of GNU empty initializer extension
      [-Wgnu-empty-initializer]
int a[10] = {};
	    ^
1 warning generated.

Note: global variables are always zeroized.

There is a partial array initialization where the initializers are called designated initializers in the C spec:

char array[128] = { [0] = 'A', [2] = 'f', [4] = 'o', [6] = 'o' };
  • A subscript is in square brackets
  • The [subscript] is known as a designator. Increasing ordering is not required but expected.
  • The rest of elements will be initialized to zero
  • If you do not specify the array size, it is taken from the highest designator index
  • You can combine designators with fixed order initializers, and you always start with the next index. For example:
  /* a[5] is 'e' etc.  a[0]..a[3] are NUL characters (= zero bytes). */
  char a[128] = { [4] = 'h', 'e', 'l', 'l', 'o' };
  • You cannot specify a repetition or setting multiple elements to the same value (there is a gcc extension for that but let's not go there).

👀 array-designated-initializers.c

Note that the code file right above mentions in a comment that a missing = is a GCC extension. With a non-GCC compiler it does not compile:

$ cc array-designated-initializers.c
"array-designated-initializers.c", line 15: syntax error before or at: 'A'
cc: acomp failed for array-designated-initializers.c

Once an array is declared, its elements cannot be assigned at once. So, you can only do things as follows:

int array[4];

array[0] = 1;
array[1] = 2;
array[2] = array[3] = 3;
// ...

You cannot assign an array into another array - has to be done an element by element.

  • Likewise for a comparison

Arrays cannot be declared as empty (int a[0]).

  • This is explicitly forbidden by the standard, see C99 6.7.5.2 Array declarators.
  • GCC accepts that though. Do not use it like that.

👀 empty-array.c

This might be a bit confusing though:

$ gcc empty-array.c
$ ./a.out
4
0

Even if a compiler supports an empty array declaration, sizeof(a) / sizeof(a[0]) construction is always correct to compute a number of array elements. Plus, remember the compiler does not do any array access when computing sizeof(a[0]) as the expression is not evaluated (see the sizeof operator ), the compiler only uses the argument to determine the size.

Function introduction

Functions can be used to encapsulate some work, for code re-factoring, etc.

A function has a single return value and multiple input parameters.

  • If you also need to extract an error code with the value or get multiple return values, that needs to be done via passing data via reference (more on that when we have pointers).
  • I.e. this is not like Go that returns an error along with the data.

A function declaration is only declaring an API without its body. In C, it is called a function prototype and it consists of a type, a function name, and an parameter list in parentheses. For example:

int digit(int c);
int space(int c);
float myfun(int c, int i, float f);

When defining the function, its body is inclosed in {} (just like the main function).

int
return_a_number(void)
{
	return (1000);
}

When we declare or define a function, the function has parameters. When we call such a function though, we pass in arguments. So, for the above mentioned function digit, if called as digit(7), we passed argument 7 as the parameter c. Note that in the past these were also called, respectively, formal parameters and actual parameters.

Also as with the main function, you can use void instead of the argument list to say the function accepts no arguments. You could just use () to indicate the function has no arguments but that is an old way of doing things and in that case the compiler will not verify the number of arguments when calling the function (you can check it out for yourself). So, always use (void) if the function has no arguments.

void
hola(void)
{
	printf("Hola!\n");
}

A function call is an expression so you can use it as such:

printf("%d\n", return_a_number());

The default return value type is an int but a compiler will warn if it is missing. You should always specify the type in C99+ . You may use void which means the function returns no value.

Write your function declarations at the beginning of a C file or include those into a separate header file.

If a compiler hits a function whose prototype is unknown, warnings are issued. Remember 👀 hello-world1.c ? See also here:

$ cat test.c
int
main(void)
{
	fn();
}

float
fn(void)
{
	return (1.0);
}
$
$ gcc test.c
test.c:4:2: warning: implicit declaration of function 'fn' is
invalid in C99
      [-Wimplicit-function-declaration]
	fn();
	^
test.c:8:1: error: conflicting types for 'fn'
fn(void)
^
test.c:4:2: note: previous implicit declaration is here
	fn();
	^
1 warning and 1 error generated.

Parameter names may be omitted in prototypes, i.e.:

int myfn(int, int);

A variable defined in function parameters or locally within the function overrides global variables. Each identifier is visible (= can be used) only within a region of program text called its scope. See C99, section 6.2.1 Scopes of identifiers for more information.

Within a function body, you may use its parameters as any other local variables.

int
myfn(int i, int j)
{
	i = i * j;
	return (i);
}

As arguments are always passed by value in C, the variable value is not changed in the caller:

/* myfn defined here */

int
main(void)
{
	int i = 3;

	printf("%d\n", myfn(i, i));	// will print 9
	printf("%d\n", i);		// will print 3
}

Local variables are stored on a stack or in registers.

Argument passing depends on bitness and architecture. E.g. 32-bit x86 puts them on the stack, 64-bit x64 ABI puts first 6 arguments to registers, the rest on a the stack.

Functions can be recursive. 👀 recursive-fn.c

/*
 * Print 0..9 recursively using a function call fn(9).
 */
#include <stdio.h>

void
myfn(int depth)
{
	if (depth != 0)
		myfn(depth - 1);
	printf("%d\n", depth);
}

int
main(void)
{
	myfn(9);
}

🔧 Print 9..0 using a recursive function.

solution 👀 recursive-fn2.c

Arrays and functions

Arrays in C are not a first class object, rather it is an aggregation of elements. An array is one of aggregate types, see C99 spec 6.2.5 Types (paragraph 21) for more information.

An array cannot be returned from a function. A pointer to an array can be but more on that later.

Not allowing to return an array is done for efficiency as copying the whole array (by value) would be too expensive.

  • Watch for array sizes if used within functions.
  • Arrays as local variable may significantly increase the stack size (and a stack size is limited in threaded environments).

👀 func-large-array.c

The following may or may not happen in your environment:

$ cc func-large-array.c
$ ./a.out
Segmentation fault: 11

Strings

A contiguous sequence of non-zero bytes (chars) terminated by a zero byte is called a string (C99 7.1.1).

char foo[] = { 'b', 'a', 'r', 0 };

Note that a string does not have to be declared as an array as shown above, it may be just a piece of memory that meets the definition of a string above.

The crucial fact is that a string is always terminated by a null (zero) byte, sometimes also called NUL. That is how C knows where the string ends. There is no metadata involved with a string, e.g. its length. To figure out the length of a string, the string must be sequentially searched for the first NUL.

The string length is the number of bytes preceding the null character.

It also means the size of such an character array is one byte more than the number of non-zero characters in the string. It is because of the terminating zero (\0). The NUL belongs to the string though.

char foo[] = { 'b', 'a', 'r', '\0' };

printf("%zu\n", sizeof (foo));		// prints 4

Using '\0' suggests it is a terminating null character but it is still just a zero byte. So, you could use 0, as follows, but it is not generally used:

char foo[] = { 'b', 'a', 'r', 0 };
  • To print a string via printf(), you use the %s conversion specifier:
char foo[] = { 'b', 'a', 'r', 0 };

printf("%s\n", foo);	/* will print "bar" without the quotes */

🔧 What happens if you forget to specify the terminating zero in the above per-char initializator and try to print the string ?

👀 array-char-nozero.c

Note that the following array of characters contains three strings:

char foo[] = { 'a', '\0', 'b', 0, 'c', 0 };

And a character array does not need to be or contain a string at all:

char not_a_string[] = { 'h', 'i' };

A special case of of initializing a character array that is to represent a string is using double quotes. The terminating zero will be the last character of the array.

char mys[] = "This is a string";

int i = 0;

/* Now print the string from the array one by one. */
while (mys[i] != '\0')
	printf("%c", mys[i++]);
printf("\n");

/* Normally, you would of course do the following. */
printf("%s\n", mys);

👀 init-array-from-string.c

Note that the above really creates a local variable on the stack with its memory initialized from the string when the code is executed.

$ cc init-array-from-string.c
$  strings a.out | grep bar
foobar

Function return values and ABI

Let's look at the following code. Obviously, the function does not return its value while it definitely should have. What happens if we use the function return value anyway?

👀 why-it-works.c

#include <stdio.h>

int
addnum(int n1, int n2)
{
        int n = n1 + n2;
}

int
main(void)
{
        printf("%d\n", addnum(1, 99));
}

It simply cannot work, right? Oh, wait..., it does!?!

$ cc why-it-works.c
$ ./a.out
100

So what happened???

Well, it does not really work. It "works" by accident. Let's look at it more closely. To establish a common environment, I will be using the Linux lab at Malá Strana which you all have access to, so that you can try and see.

Before I begin, let me give you more information:

  • It depends on a system and its version, and a compiler and its version.
    • You can also try the clang compiler later on the same machine.
  • Will be using gcc which defaults to generate 64 bit binaries on the Linux distro installed on u-pl3.ms.mff.cuni.cz (and other machines in the lab).
  • 64 bit binaries on x86 use X86-64 ABI
  • ABI is not API
  • By this ABI (it's law!), the first two function integer arguments are passed through the general purpose edi and esi registers
  • Integer function return value is passed back to the caller via the general purpose eax register
  • The stack on x86 grows down

Let's compile the code and disassemble it. All we need are the main and addnum function. See my notes inline.

$ cc why-it-works.c
$ objdump -d a.out
...
<removed non-relevant stuff>
...
0000000000001145 <addnum>:
    1145:	55                   	push   %rbp
    1146:	48 89 e5             	mov    %rsp,%rbp

    	- initialize a frame pointer for this function.

    1149:	89 7d ec             	mov    %edi,-0x14(%rbp)
    114c:	89 75 e8             	mov    %esi,-0x18(%rbp)

	- here we put both function arguments on a stack.  As we already
	  learned, function arguments may be used within a function as local
	  variables, so they need to be on a stack.

	- why we put them to -0x14 and -0x18 offsets from the base?  That's just
	  what this gcc version does.  They are 4 bytes apart as we work with 4
	  byte integers.

    114f:	8b 55 ec             	mov    -0x14(%rbp),%edx
    1152:	8b 45 e8             	mov    -0x18(%rbp),%eax

	- moved the values of our local variables to general purpose registers

    1155:	01 d0                	add    %edx,%eax

    	- sum up the values.  As we add edx to eax, the result is in eax.

    1157:	89 45 fc             	mov    %eax,-0x4(%rbp)

	- put the result to the local variable "n" which happens to be at offset
	  -0x4 from the frame pointer.

	- now, see above, register eax is used in x86 64 ABI for the function
	  return value.  So, by accident, we have the "right" value in the
	  register that is supposed to hold the return value!!!

    115a:	90                   	nop
    115b:	5d                   	pop    %rbp
    115c:	c3                   	retq   

000000000000115d <main>:
    115d:	55                   	push   %rbp
    115e:	48 89 e5             	mov    %rsp,%rbp
    1161:	be 63 00 00 00       	mov    $0x63,%esi
    1166:	bf 01 00 00 00       	mov    $0x1,%edi

	- put the values 1 and 99 to the registers representing the 1st and the
	  2nd argument in x86 64 bit ABI

    116b:	e8 d5 ff ff ff       	callq  1145 <addnum>

	- called our function from main()

    1170:	89 c6                	mov    %eax,%esi

	- x86 64 ABI expects the function return value in the register eax, so
	  we just put it to the input register esi representing the 2nd argument
	  for the printf() function.
	- and, given that we happened to have the right value in eax, the code
	  looks like it works.

    1172:	48 8d 3d 8b 0e 00 00 	lea    0xe8b(%rip),%rdi        # 2004 <_IO_stdin_used+0x4>
    1179:	b8 00 00 00 00       	mov    $0x0,%eax
    117e:	e8 bd fe ff ff       	callq  1040 <printf@plt>
    1183:	b8 00 00 00 00       	mov    $0x0,%eax
    1188:	5d                   	pop    %rbp
    1189:	c3                   	retq   
    118a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)

What if we increment variable n, does it change anything?

int
addnum(int n1, int n2)
{
	int n = n1 + n2;
	++n;
}
$ gcc main.c 
./a.out 
100

Apparently not, let's see the disassembled code:

0000000000001145 <addnum>:
    1145:	55                   	push   %rbp
    1146:	48 89 e5             	mov    %rsp,%rbp
    1149:	89 7d ec             	mov    %edi,-0x14(%rbp)
    114c:	89 75 e8             	mov    %esi,-0x18(%rbp)
    114f:	8b 55 ec             	mov    -0x14(%rbp),%edx
    1152:	8b 45 e8             	mov    -0x18(%rbp),%eax
    1155:	01 d0                	add    %edx,%eax
    1157:	89 45 fc             	mov    %eax,-0x4(%rbp)
    115a:	83 45 fc 01          	addl   $0x1,-0x4(%rbp)

    	- OK, the compiler just directly adds 1 to the memory address holding
	  local variable "n".  As it does not change the register eax, it still
	  "works".

    115e:	90                   	nop
    115f:	5d                   	pop    %rbp
    1160:	c3                   	retq

So, let's involve another local variable.

int
addnum(int n1, int n2)
{
	int n = n1 + n2;

	int i = 13;
	n = i + n;
}
$ gcc main.c
$ ./a.out
13

Bingo! It no longer "works". What happened?

0000000000001145 <addnum>:
    1145:	55                   	push   %rbp
    1146:	48 89 e5             	mov    %rsp,%rbp
    1149:	89 7d ec             	mov    %edi,-0x14(%rbp)
    114c:	89 75 e8             	mov    %esi,-0x18(%rbp)
    114f:	8b 55 ec             	mov    -0x14(%rbp),%edx
    1152:	8b 45 e8             	mov    -0x18(%rbp),%eax
    1155:	01 d0                	add    %edx,%eax

    	- this was summing up our two function arguments

    1157:	89 45 f8             	mov    %eax,-0x8(%rbp)

	- the result went to the local variable "n"

    115a:	c7 45 fc 0d 00 00 00 	movl   $0xd,-0x4(%rbp)

	- we initialized local variable "i" (0xd in hexa is 13 in decimal)

    1161:	8b 45 fc             	mov    -0x4(%rbp),%eax

	- See?  Now we used eax for further arithmetic operations.  We just put
	  the current value of local variable "i" to the register.

    1164:	01 45 f8             	add    %eax,-0x8(%rbp)

	- and we added the register value to the present value of local variable
	  "n".  And as 13 was left in eax, that served as the function return
	  value.

    1167:	90                   	nop
    1168:	5d                   	pop    %rbp
    1169:	c3                   	retq

🔧 Try the clang compiler and figure out what happened there.

$ clang main.c
main.c:7:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^
1 warning generated.
$ ./a.out
0

What we should take away from this situation:

  • Anything looking as working does not mean it is correct.
  • Ideally, if possible, test on different architectures (like x86, SPARC, ARM, etc).
  • Using different compilers and different systems may help as well. For example, if you develop in a Linux distro with gcc, testing it also on a macOS laptop would be worth it (C compiler in macOS Xcode IDE is clang).
  • If something magically stops working that did work before, be ready for breakage like this. Even something that has worked for ages does not necessarily means the code must have been correct.
  • Always use -Wall -Wextra GCC options when building your code. More on that later in the seminar. See that clang warns by default which is a good thing.

Do you want more information on x86 ABI and how it really works behind the scene? Search for Solaris Crashdump Analysis Frank Hofmann for more details.

You probably do not want all the gory details though...

🔧 Home assignment

Note that home assignments are entirely voluntary but writing code is the only way to learn a programming language.

🔧 Count digit occurrence (use arrays)

Write a simple program on counting the digit occurence as in one of the previous classes but now use an array for counting the figures.

🔧 Count digits, white space characters, and the rest

Print total number of (decimal) numbers, white space (tab, space, newline) and anything else from the input. I.e. the program prints out three numbers.

Obviously, reuse code you wrote for digit occurence counting: count-digit-occurence.md

  • write the program using direct comparisons like this:
c >= '0' && c <= '9'
c == ' ' || c == '\t' || c == '\n'
  • put both the checks into separate functions
  • then write a new version of the program and use:
    • isspace() from C99
    • isdigit() vs isnumber() - the latter detects digits + possibly more characters (depending on locale setting)

🔧 Count letter occurrence

  • Count occurrences of all ASCII letters on standard input and print out a histogram
  • This will be case insensitive (i.e. implement mytolower(int) first)
  • The output will look like this:
$ cat /some/file | ./a.out
a: ***
b: *
c: *****************************
...
z: *******
$
  • Use a function to print a specific number of stars.
  • Use a function to do all the printing.
  • Declare array(s) to be as efficient as possible.
    • That is, having the lowest possible size.
    • Only use global variables if necessary.