Unraveling Rust Design

Following topics are covered in the blog,

  • Difference between system programming and scripting language
  • Why do we need a new programming language
  • Understand the different terminology in system programming language
  • Deep-drive into the value propositions of Rust language

Introduction

This blog is focussed on bringing about a behavioral change in the readers on understanding the problems in system programming language space and the unique properties of rust language, which helps to overcome them.

The blog is successful if it manages to convince the readers why one has to use Rust when there are so many other programming languages out there solving various problems in different verticals, why is there a need for yet another programming language.

What is a system programming language?

System programming in a large scope can be referred to as a computer programming language which helps us to write system software and is typically system constrained. We can associate system software with operating systems, utility software, device drivers, compilers, and linkers.

To understand it easily lets compare it to application programming where the main aim is to produce a production application which can be consumed by the end user where the main aim of the developer is towards the implementation of the different functionalities of the application, but in system programming the aim is to build the software platform where these applications software’s can run for example operating systems, game engines are good examples of end software.

b05117_01_01

Another classic example is a kernel which is a vital part of the OS which exposes API’s to application software and acts as an interface to device drivers. System programming is hardware specific and there has been great research over years to write optimized and performance centric code for monetary benefits for the platform service providers, we can associate word processor as a cloud service as a typical service provider.

Understanding the problems in the space of system programming

We all have heard about programming language such as C, C++, Java, Python, Ruby, JS, Haskell, which are widely used by programmers all over the world to build many products and deliver IT services. But if we take a closer looker at them in terms of system programming metrics such as control and safety we can broadly classify them into this broad spectrum spectrum as shown below.

b05117_01_02

The spectrum basically shows us the thresholds of control and safety of different programming languages, we can see C/C++ & Java tend to be more cornered towards the left-hand side of the spectrum where control features are dominating and towards the right its scripting languages such as python and JS are more prominent where the safety features are more dominant.

In C/C++ we exactly know what is happening at machine level and have full control over the memory layout, but not very safety features as we have to explicitly write them, when is the last time we heard kernel/ OS level code written in languages other than C. Over the years there has been a great work on building stable system software. Its very prone to segmentation faults and buffer overrun which are very prominent errors, let’s talk about these errors in brief.

Segmentation fault is an error which every developer must have faced at least once, it is kind of error when you try to access memory which does not belong to unit/object of the code, it happens mostly when you are trying to access variables which have already freed or writing to a read-only portion of the memory. It is common across all the programming language where the code is messed up with the memory management. 

There are a lot of ways in which you may encounter a segfault say in C++ common way to get is by:

Dereference a null pointer, for example, we declared a null pointer

int *null_pointer = NULL;

The code will break with segfault if at any point of time in the code we try to assign the null pointer,

*null_pointer = 1;

Another common segfault is when you write to a portion of memory which was marked read-only for example lets declare a constant string

char *str = “Rust_cookbook”;

the compiler marks the string as a read-only, we would get an error when we illegally assign to the string which would break into default

*str = “Another_charater”;

Buffer Overrun is yet another common security risk where the program or the input is trying to write more data into a fixed memory block, which causes it data spill over to adjacent memory locations. C/C++ are prone to buffer overflow attacks as they have no built-in protection against accessing or overwriting data. On the other had we have scripting language like python and Javascript which do not comply and we pretty much know all the standard steps during the runtime, which makes it really safe. Let’s discuss runtime in detail in order to get the complete picture,

Compiler: In Action

The compiler basically has a parser which reads the source code and applies the syntax rules of the language.The first check is that if the particular syntax does not exist in the language it will throw an error and exit the complication steps, if the parser stage is successful we get an Abstract syntax tree or AST in short. AST is basically a tree-based data structure where it contains nodes. Each node contains an element of the syntax. Parser basically produces a very big tree data structure containing all the element in the source code. After AST generation we do the semantics check which checks the code flow and verifies its operations, for example, we would break the cycle if we declare a function which accepts 1 argument but it is passed 3 arguments. The last stage generates the code which converts the AST to the code in the output language, which is the final output product.

Interpreter

Say we have a python script in run time the code gets parsed, analyzed, and fed into an interpreter. In python, we have interpreter which works in exactly the same way as a compiler but without code generation, it loads the output in memory and output is executed directly.

All the languages come with few predefined libraries and functions which a large population of developers would be using in order to perform few basic operations, these standard libraries sometimes can be really complex for example best of luck implementing print to send out standard output in the terminal in python. These functions are collected and used in runtime when the code calls them, this is the base reason why we call this library as run-time libraries. A run-time is a software designed to support the execution of computer programs written in some computer language.

Rust: Introduction

Rust is a system programming language which offers no trade-off between control and safety. It is designed to deliver low-level control like in C/C++ and high-level constructs which enable safety features like in python/JS.

In short, Rust is a systems programming language that runs blazingly fast, prevents segfaults, and guarantees thread safety.  

Rust: history

Rust language project is sponsored by Mozilla and is also a community driven project, where the majority of commits to the project comes from the community members, which makes it highly reliable to use as the project will not die even if Mozilla stops sponsoring it, which is highly unlikely as they are building state of art web tools and browser engine using rust.

Rust 1.0, the first stable release, was released on May 15, 2015. Rust won the first place for Most Loved Programming Language of 2016 in the Stack Overflow Developer Survey.

The design of the language is highly refined by the development experience of building Servo which is an experimental web browser parallel engine being developed by Mozilla Research, it aims to create highly parallel environment along with delivering high safety. Mostly browser engines are written in C/C++ which are a millions line of code, think about restructuring them to perform parallel and often C/C++ runs into memory vulnerability errors and buffer overrun issues. Other ideal example for where Rust is used Skylight which is ruby gem which runs in the rail application at the client end which records basic analytics such as time for HTTP request etc, since it runs on the client side we have memory constraints and also require low-level control without compromising safety, Rust was a ideal fit for these use cases.

Let’s discuss about different features of system programming in detail and understand Rust’s design,

What is control in system programming language?

Before deep driving few technical concepts let’s try to briefly understand how memory allocation works, We have to understand two generic terms stack and heap which are widely used for memory allocation.

The stack is the memory set residing on the RAM for a thread of execution, say when a function is called a block is reserved on the top of the stack for local variables and some data for the execution, when the function returns the block become unused and can be freed, freeing a block from the stack is nothing more than adjusting one pointer.Stack items sit on top of the one another in the order they were placed there, and you can only remove the top one, this implies it follows LIFO (Last input first output) concept.

The heap is memory set on the RAM which follows dynamic allocation. Unlike the stack, there’s no enforced pattern to the allocation and deallocation of blocks from the heap. We can allocate a block at any time and free it at any time.

There isn’t any difference between stack and heap in performance but the lifespan is a major difference, any local variable inside a function lives on the stack. It goes away when you return from the function. If you want something to live longer than the function that declared it, you must allocate it on the heap.

Let try to understand the stack working with a C++ example below,

#include <iostream>

The above command is calling the standard input-output library in C

double multipleTwo (double input) {
    double twice = input * 2.0;
    return twice;
}

The multipleTwo returns a double output and accepts a double type input.

int main (int argc, char *argv[])
{
    int age_val = 17;
    double salary_val = 234.67;
    double List_val[3] = {3.2, 1.3, 4.9};
    printf("Your salary is %.3f\n", multipleTwo(salary_val));
    return 0;
}

The main function calls a the multipleTwo and passes the salary_val value.

Lets understand how memory allocation works, we declare variables: an int, a double, and an array of three doubles. These three variables are pushed onto the stack as soon as the main() function allocates them. When the main() function exits (and the program stops) these variables are popped off of the stack. Similarly, in the function multipleTwo(), the twice variable, which is a double, is pushed to the stack as soon it is called and pops out when it goes out of scope.

Let’s see an example when the variables are stored in the heap.

int *age = malloc(sizeof(int));
*age = 30;
free(age);

Using malloc() is to allocate memory on the heap and then using free() to deallocate it, in the above code snippet we have declared a pointer which is special datatype in C which store the addresses in memory instead of storing actual values.

Consider the example C++ below,

void example(){
    vector  vector;
    …
    auto& elem = vector[0];
    …
} 

b05117_01_03-1

In the above example, we are declaring a string type vector which goes to the stack is basically comprised of data, length, and capacity. We see in the above figure that all the allocation follows an inline layout in both the stack and the heap. The data on the stack points to the heap where we have a string which is wrapping a vector which again maps to another location.

b05117_01_03-2

Another part of control in system programming is the lightweight references here we have elem variable which is just a pointer pointing to a memory, this defines the control aspect. We have deterministic destruction where we exactly know when the vector will be removed, in the example function if it goes out of scope the destructor will run and delete the vector. This is where we get the fine grain control over the memory.

In the most of the cases, we have this extra layer of interaction in the case of C/C++ we have control over this extra layer of interactions.

b05117_01_04

lets take the example of Java, In the above diagram the vector in the stack is pointing to data in the memory which again is a pointer to so other memory location and so on this memory layout comes with the language and is imposed on the developer.

All this boils down to Zero-cost abstraction which is the ability at the compile time to optimize to nothing, which means an easy to use interface which provides the best memory layout and control. Its just not the memory layout but the other aspects of it are we should know at run time how a function is being called whether it is statically or dynamically dispatched, another idea is template expansion the optimisation done for different datatype should be tailor made for example vector of type int and string should have different optimisation by the programming language.

What is safety in system programming language?

Lets start with the C++ example which we used above with little tweaks,

void example(){
    vector  vector;
    auto& elem = vector[0];
    vector.push_back(some_string);
    cout << elem;
} 

b05117_01_05-2

If you are familiar with C++ we know that when the vector exceeds the memory block assigned to it we relocate it to a new memory chunk by copying the existing data and point the vector pointer to the new memory allocation and clear the previous memory location.

B05117_01_05.3.png

The pointer elem still pointing to the previous memory location of the vector, which makes it a dangling pointer which is still pointing to a freed memory and in the cout step the program will have a memory error or segfault. Its very difficult for the developer to find these problems in a very large chunk of code.

Lets try to understand what exactly is wrong here and identify the problem statement, There two important this happening as you can see below that are aliasing and mutating. Aliasing is the case where two-pointers as pointing to the same location and Mutating is the process of referencing the memory location of the data pointer and making data changes.

b05117_01_05-4

The main problem is that the aliased vector pointer does not know any information about the mutated data, we do not face any problems when these activities are independent and we face these memory & safety bugs when we have both them happening simultaneously.

What is Garbage collection (GC)?

The garbage collector is memory management feature in modern programming languages, such as Java. Languages that use garbage collection are often interpreted or run within a virtual machine like the JVM. In each case, the environment that runs the code is also responsible for garbage collection.

C/C++ allows allocating and freeing memory is done manually by the programmer which often leads to memory error as shown above because of the dangling pointer can cause serious bugs. Programs with an automatic garbage collector (GC) try to eliminate these bugs by automatically detecting data chunks which are no longer required by the program. GC basically ensures two assets which are to clear memory which is longer used and secondly to ensure not to clear any memory which can be used by the program.

We might we wondering these problems of dangling pointer and runtime errors are to be taken care by GC. But often we do forget the GC does not have enough control and need runtime for solving these complex problems, its beneficial to solve these problems without the need of runtime. Also, GC is sufficient to solve problems such as iterator invalidation & data races (These are basically concurrent bugs).

Rust’s Design and Solutions

Most of the problems that we discussed above are some what interlinked and Rust is basically trying to solve them all. The main concept which we will focus is the concept called Ownership/Borrowing.

Ownership & Borrowing basically provides you three important points,

  • There is no runtime cost – just like C++, all the code is analyzed statically
  • Totally memory safe – These are properties of GC
  • Data-race freedom

Ownership

B05117_01_05.6.png

Lets look into this in story manner lets say Person A(orange) is an owner of a book and decides it give it to a Person B(Blue), Once the book is sent it is no longer associated with Person A as the owner of the book has changed. Since the book has a new owner we can’t delete or deallocate its resources, but Person B being the owner of the book can decide to leave the book. When a situation happens Rust identifies that the book has no owner and can be deallocated.

Earlier in the blog, we had discussed about the aliasing and mutation, with ownership concept we can totally avoid aliasing as at a particular point of time there will be only one owner of the resource. We still have a mutation which can happen freely.

Lets look into rust snippet below to understand the concept better,

fn take(vec : Vec) {
    …
    …
}
fn give() {
    …
   let mut vec = Vec::new();
   vec.push(1);
   vec.push(2);
   take(vec);
   …
}

B05117_01_06.1.png

Consider the give rust function in the above snippet which is basically defining a vector and pushing in data values 1,2 when the take function is called in the program there is basically a copy of the vector is made which enables take function to take the ownership of the vector, we can now see that there are two copies in the stack pointing to the same data

B05117_01_06.2.png

which is the case of aliasing so rust basically shallows the original pointer and complete authority to the vector is given to the take function, the data is changed accordingly in the processes of the take function, since there is no return from the take function the take values in the stack and data are deleted once take function goes out of scope.

b05117_01_06-3

lets say we try to access the vector after is moved,

fn give() {

    let mut vec = Vec::new();
    vec.push(1);
    vec.push(2);
    take(vec);
    
    //We would get an error here saying that the vector has been moved
    vec.push(3); 

}

We basically achieve free after use which is deallocation of resources after its scope is complete and enable complete movement.

Borrowing

Borrowing a process of lending out the ownership to another party for a short period of time but the lifetime or in another word the scope of the variable is still under the control of the actual owner.

There is two type of borrowing in Rust, both the cases aliasing and mutation do not happen simultaneously.

Shared Borrowing

b05117_01_07

We are forbidding mutation and allowing aliasing, here we are sharing the ownership to multiple parties.For example,

fn take(vec : &Vec) { 


}

fn give() {

   let mut vec = Vec::new();
   vec.push(1);
   vec.push(2);
   take(&vec); -> Loan out the vector

}

b05117_01_09-1

In the above case, we are basically loaning out ownership of vector from the give function to the take function. The ampersand symbol in the argument indicates that the vector is a shared reference. In runtime, a vector reference is gonna be made and once it goes out of scope it been deallocated and ownership is transferred back to the give function.

Shared references are immutable, which implies that there the borrowed element data values cannot be changed they can only be accessed in the scope of the party for example,

fn take(vec : &Vec) { 
   vec.push(3);
   vec[1] += 3;

}

Both the above steps are forbidden and we will get an error saying that the shared references are immutable. This is not 100 percent true as under control circumstances we can make changes to these shared references.

Mutable Borrowing

b05117_01_08

We are forbidding aliasing and allowing mutation, here we are lending out ownership to a single party who can go ahead and lend to another pointer but implicitly back of  all pointer and come back to the source.

Mutable references are denoted by the &mut tag. Consider the examples below to get a better idea,

fn push_all (from : &Vec, to: &mut Vec) {
    for elem in from.iter() {
        to.push(*elem);
    }
}

b05117_01_09-2

Since the to is a mutable reference in the function push_all pushing values is totally legal and complier will allow it. We have a reference elem which is a pointer pointing to the memory location of the from vector and pushing its values to a new vector, which is a mutable reference.

Consider the cases where the both argument to the push_all function are the same vec which will be the case where both aliasing and mutation is possible.

fn push_all (from : &Vec, to: &mut Vec) {..}

fn caller () {
    let mut vec = ..;
    push_all(&vec , &mut vec);
}

b05117_01_09-3

In function caller, we will have a dangling pointer situation which makes it totally illegal since the modification are made to the source memory location.

The &mut variable is the only way to access the memory it is pointed at. In these cases complier will not allow it execute will throw an error.Rust complier has the complete understanding of the shared and mutable references.

To understand above statement consider the example below,

{
    let mut vec = Vec :: new();


    
    for i in 0..vec.len(){

         let elem : &i32 = &vec[i];

         vec.push(..); // compiler will not allow this statement
    }
    vec.push(..);
}

Inside the for loop the vec is a shared reference and the scope of elem is inside the for loop and the compiler will not allow the addition of new value with the vec.push() statement inside the for loop, once the complier exits the for loop the scope of the elem is expired the mutable vec can be made changes.

Concurrency

Concurrency is when two or more tasks can start, run, and complete in overlapping time periods. It doesn’t necessarily mean they’ll ever both be running at the same instant. Eg. multitasking on a single-core machine.

Using the above fundamental concepts of rust that we just learned. we are gonna build libraries which are gonna tackle of problems in concurrency, shared memory. Rust has few best practices which are statically imposed by the language for usage of these paradigms in order to leverage ownership and borrowing.

b05117_01_10

The main problem we are trying to solve in concurrency is Data race. Lets first understand data race, it is the case where two threads are accessing the same data and at least one of them is writing to it, in C++ this is an undefined behavior. Aliasing, mutation, and unsynchronisation is the root cause of data race, we are gonna leverage messaging to solve data race situations.

b05117_01_11

The key idea here is to pass ownership of message between thread, say a particular thread has ownership of a message and it can pass the message to another thread which passes by ownership and get back a message. Rust is enforcing isolation of thread at compile time when an ownership of the message is transferred to another thread, the original thread no longer has access to the message that has been transferred.

Let understand it better with an example,

b05117_01_12-1

fn parent(rx : Reciver) {
    spawn(…);

    let rx_mssg = rx.recv();
}

b05117_01_12-3

The above function is the parent function which is at the receiver end will spawn a new child thread to accept the transmission.

fn child() {
     let v = Vec::new();
     …
     tx.send(v);
}

b05117_01_12-4

The child thread is at the transmission end as shown in the figure above, the child thread creates a vector which is sent to the parent which is the receiving end.

The common thing that both the parent and child thread share is the channel, which is a shared memory space. The child thread created the vector which adds the data to the memory while the ownership of the vector is still is held by the child thread, but when it sends the ownership is transferred to the channel, at this point of time in runtime both the threads do not have ownership of the data, it is the channel which has the ownership of the thread. When the parent thread receives the data the ownership is transferred to the parent thread, by this paradigm at given time only one thread has ownership of the data avoiding data races.

Shared read-only access, we would like to have a data which is only read-only and is been used by various entities we have something called Arc in rust which stands for Atomically Reference Counted, its declared by

Arc<Vec> 

b05117_01_13

Arc has ownership of the data, it can be only shared for which we use shared reference which makes mutation impossible which implies that threads cannot edit the data source. The only difference is that in Arc we have addition field in the memory layout of the data field which is names ref_count, which increments its value when the shared resource is being accessed.

Locker mutable access, now we want mutable access of variables. Rust has mutex which provides extra synchronization to prevent threads running simultaneously on the data leading to data races, let’s take a look at the example below

fn push (mutex : &Mutex<Vec>) {
    let mut guard = mutex.lock();
    guard.push(..);
}

In the above snippet, we have mutex vector which has a similar layout of the Arc, mutex holds ownership of the data in the memory, with mutex.lock() Rust provides a shared pointer to the data which is mutable. But it follows deterministic destruction which is the at the end scope of the push function.

Rust mutex explicitly provides mutable access by providing ownership only after the lock statement. Within with the mutex lock span, we can shared reference to the data but after the scope of the mutex lock, the ownership will be revoked in order to prevent data-race.

Thread safety checking in rust this feature allows to check if a primitive is thread safe or not, for example

fn transfer (t : T){..}

send is the checkpoint which allows whether the primitive can be sent or not, Arc<Vec> is a sendable primitive whereas Rc<Vec> is not.

Conclusion

The above are few features of Rust which make it a great system programming language which enables high-level features of python, JS along with low-level features of C/C++.

Rust gives stronger safety guarantees that garbage collector such as deterministic destruction, data-race freedom, iterator invalidation which enables the developer to program confidently without these bugs at compile time.

Check out my book named Rust CookBook by Packt publication which comes with a lot of application-specific recipes, which will help you get kick-started with developing real-world high-performance applications with the Rust programming language and integrate rust units into your existing applications

B05117

Reference

Advertisements

2 thoughts on “Unraveling Rust Design

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s