What I Learned in Boating School: Fall 2018 part 2

I find what I learn in school very exciting and I want to share it with the world. I will be maintaining this series to recap on my favorite topics at the end of each semester.

Two of my favorite and most challenging classes this fall were Computer Architecture and Programming Language concepts and I'm excited to talk about some of the techniques I learned this semester.

Computer Architecture

In this class we started with how characters, integers, and floating point numbers are represented in binary. We refreshed ourselves on binary arithmetic and learned how the C language is interpreted into bytecode.

From there we moved onto how to decompile C programs into assembly using objdump and how to intepret the assembly language.

We used these skills in order to find when buffer and stack overflows occur, how to identify how gcc prevents overflows using a stack protector (canary) register. We also looked into program optimization when working with loops by interpretting how data structures are stored on the stack and how this affects iteration.

Binary Arithmetic

Adding binary numbers works in the same way as decimal numbers. For example 10101+10011 or 21+19

Subtraction works the same way as decimal subtraction, however, it requires a change of thought.

For example, with 124-8 you might cross of the 2 and carry 1 to the 8 to make 14-8. You are removing a value of 10, or ten 1's from the 20.

If you were to subtract 1100-0010 or 12-2.  you would need to carry the 1 as well. However, when working in a power of two, you will not just be carrying the 1 you will be carrying two 1's which properly abstracts the full iteration of a digit in a 2's number system. This would result in 1010  or $10$.

Floating point numbers are the most interesting, they take up 32bits in single precision and can reach a surprising level of accuracy. This is done by encoding a number in scientific notation into the binary number system.

With only 32 bits, the number can only be so accurately represented. In cases where you cannot fit the number into 32bits, rounding occurs just like we do when reducing numbers in algebra.

Optimizing for Multiplication and Division

Multiplication and Division in binary can take several operations. Consider $a*b$,  in order to multiply these numbers you would have to add $a$ b times. The more you multiply the more expensive it would be.

Thankfully, we can exploit a property of the binary numbering system, and reduce the required amount of operations using bitshift operators.

Lets keep this simple by using unsigned binary integers 4bits in length. Consider $42$, or $01000010$. We know the result would be eight, or 1000. Which traditionally would be:

Four time two (0100 + 0100) in binary

However, we could look at the multiplication as $x2^k$ where $k$ is the number factor of 2 for which you would like to multiply. Multiplying by 2 in a binary number system is the same as $2^1$ so we would want to bitshift <<1 in order to multiply by 2. 0100<<1 would become 1000 effectively multiplying by two in one operation.

If you wish to multiply by a factor that is not a multiple of two you would have to use multiple bitshift operations to reach the desired result.

For example $4*17$, we know this is $68$, on a computer this would take 17 addition operations. Using bitwise operations this can be made significantly faster:

0100 << 4 == 1000000 == 64 //4*(2^4)
1000000 + 0100 == 1000100 == 68

Here we performed the multiplication in only two operations! If you wanted to divide, you would use the inverse the same logic, by bitshifting to the right or subtracting. Bitshift operations make multiplication and division significantly faster.

Assembly in 1 minute

Assembly language is the closest we can get to working directly with a processor. The Intel assembly reference consists of hundreds of commands that can be used however many a very similar to the following basic commands:

  • mov  a, b: moves data from register a into register b
  • sub a, b: subtracts the value in a from the value in b
  • cmp a, b: performs a comparison between a and b
  • add a, b: adds value a to value b
  • lea -0x18(%a), %b: load effective address, computes the sum of -0x18 and a and stores it in b
  • je a: jump to address a if the last operation triggered the equal flag.

It is important to understand the cmp command and how it functions. When you compare two registers you actually perform a subtraction. You will subtract a from b and then an xor operation is run on the sign bit to determine if the result was negative or not. If the result is negative we know that a was greater than b and the appropriate condition code is triggered when the data exits the ALU.

Stack Overflows

Consider the following code inside so.c:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>

int len(char *s) {
  return strlen(s);
}

void lptoa(char *s, long *p) {
  long val = *p;
  sprintf(s,"%ld",val); // string print
}

long longlen(long x) {
  long v;
  char buf[8];
  v = x;
  lptoa(buf,&v);
  return len(buf);
}

int main() {
  longlen(INT_MAX-1);
}

If dump the assembly with objdump -d so.c > so-asm.txt. We can interpret the longlen function below:

0000000000401188 <longlen>:
  401188:   55             push   %rbp
  401189:   48 89 e5       mov    %rsp,%rbp
  40118c:   48 83 ec 20    sub    $0x20,%rsp # stack of 20bytes 
  401190:   48 89 7d e8    mov    %rdi,-0x18(%rbp) #2 bytes for x
  401194:   48 8b 45 e8    mov    -0x18(%rbp),%rax #move stack ptr
  401198:   48 89 45 f8    mov    %rax,-0x8(%rbp) #8 bytes for buf[8]
  40119c:   48 8d 55 f8    lea    -0x8(%rbp),%rdx #move stack ptr
  4011a0:   48 8d 45 f0    lea    -0x10(%rbp),%rax # stack is filled
  4011a4:   48 89 d6       mov    %rdx,%rsi
  4011a7:   48 89 c7       mov    %rax,%rdi # 10 bytes copied to 2 byte space
  4011aa:   e8 a1 ff ff ff callq  401150 <lptoa>
  4011af:   48 8d 45 f0    lea    -0x10(%rbp),%rax
  4011b3:   48 89 c7       mov    %rax,%rdi
  4011b6:   e8 7b ff ff ff callq  401136 <len>
  4011bb:   48 98          cltq   
  4011bd:   c9             leaveq 
  4011be:   c3             retq     

The gcc compiler will detect the stack overflow and use a stack protector. If we recompile and dump the program again the same function will may look something like this:

0000000000401198 <longlen>:
  401198:   55                   push   %rbp
  401199:   48 89 e5             mov    %rsp,%rbp
  40119c:   48 83 ec 30          sub    $0x30,%rsp #10 more bytes allocated
  4011a0:   48 89 7d d8          mov    %rdi,-0x28(%rbp)#2bytes for `v`
  4011a4:   64 48 8b 04 25 28 00 mov    %fs:0x28,%rax #create canary 
  4011ab:   00 00 
  4011ad:   48 89 45 f8          mov    %rax,-0x8(%rbp) 
  4011b1:   31 c0                xor    %eax,%eax
  4011b3:   48 8b 45 d8          mov    -0x28(%rbp),%rax
  4011b7:   48 89 45 e8          mov    %rax,-0x18(%rbp)
  4011bb:   48 8d 55 e8          lea    -0x18(%rbp),%rdx
  4011bf:   48 8d 45 f0          lea    -0x10(%rbp),%rax
  4011c3:   48 89 d6             mov    %rdx,%rsi
  4011c6:   48 89 c7             mov    %rax,%rdi
  4011c9:   e8 92 ff ff ff       callq  401160 <lptoa>
  4011ce:   48 8d 45 f0          lea    -0x10(%rbp),%rax
  4011d2:   48 89 c7             mov    %rax,%rdi
  4011d5:   e8 6c ff ff ff       callq  401146 <len>
  4011da:   48 98                cltq   
  4011dc:   48 8b 4d f8          mov    -0x8(%rbp),%rcx
  4011e0:   64 48 33 0c 25 28 00 xor    %fs:0x28,%rcx # check canary
  4011e7:   00 00 
  4011e9:   74 05                je     4011f0 <longlen+0x58>
  4011eb:   e8 50 fe ff ff       callq  401040 <__stack_chk_fail@plt>
  4011f0:   c9                   leaveq 
  4011f1:   c3                   retq

From what I understand the canary works by alloating space on the stack immediately after the function that risks overflowing. A copy of the value at this address is stored elsewhere in a 'canary' register.
When the function finishes running, an xor operation is used, to compare the difference between the canary register and the value immediately after the stack pointer.

Conclusion

Processors are extremely limited in their scope, more of the magic we see on screen is intense maths done millions of times a second. Intel processors can only compare two values in an iteration and therefore must copy back the data to memory to reference later.

It is important to understand how your code translates to machine code at compile time. Interpreting the Assembly output of your code can help you gain intuition as to how your code actually runs. This opens up opportunities to optimizing your code to run faster. You especially should use bitwise operations liberally and understand how they work as they can make operations much more efficient.

Because a processor can only do one operation at a time it will try to pipeline many tasks into that existing clock cycle. In practice, this can be loading a value into the ALU in the same cycle that the previous result is sent back to a general purpose register for future use.