# IR & CFG

## Compiler and Static Analyzer

### Compiler Structure

How does Compiler work?

![image-20230402150937216](https://1239337109-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F2HLPOkuOfb7iyCzDJ8vA%2Fuploads%2Fgit-blob-ab94d3a4251fd41cf2cd3ed6b1ef505a16edd704%2Fimage-20230402150937216.png?alt=media)

传统编译器架构：

![image-20230402151352286](https://1239337109-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F2HLPOkuOfb7iyCzDJ8vA%2Fuploads%2Fgit-blob-a8cddfcc54ef2429352d03bf0b0e95ffd8c7604a%2Fimage-20230402151352286.png?alt=media)

* Frontend：前端
  * Lexical Analysis：词法分析
  * Syntax Analysis：语法分析
  * Semantic Analysis：语义分析
  * Translator：生成中间代码（Intermediate Representation）
* Optimizer：优化器
* Backend：后端

  Code Generator生成机器码

### why IR

why IR is better for static analysis than AST？

![image-20240312193742675](https://1239337109-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F2HLPOkuOfb7iyCzDJ8vA%2Fuploads%2Fgit-blob-de49136f3ed3a10fc3e64aa806cafd5346f22a96%2Fimage-20240312193742675.png?alt=media)

AST：

* high-level and closed to grammar structure
* usually language dependent
* suitable for fast type checking
* lack of control flow information

IR：

* low-level and closed to machine code
* usually language independent
* compact and uniform
* contains control flow information
* usually considered as the basis for static analysis

Other candidates：

Java source code：

* statements and classes can be nested

Java bytecode

* advantages
  * no nesting; one statement follows the other; looping/branches through jumps
  * nested classes are "flattened" into normal classes
* disadvantages
  * no local variables: operations performed on operand stack
  * too many bytecodes（more than 200，many of them are overloaded based on their type）

## 3-Address Code

3-Address Code（3AC）

* There is at most one operator on the right side of an instruction
* Each 3AC contains at most 3 address

note：There is no fixed realization of the 3AC.

Address can means：

* Name
* Constant
* Compiler-generated temporary variable

Some Common 3AC Forms：

![image-20240312195128468](https://1239337109-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F2HLPOkuOfb7iyCzDJ8vA%2Fuploads%2Fgit-blob-f22de3c4a2bb44085f905e25a38972c763a25305%2Fimage-20240312195128468.png?alt=media)

> Soot is one of the most popular static analysis framework for Java
>
> Soot's IR is Jimple：typed 3-address code

> JVM complement：
>
> invoke special：call constructor、superclass methods、private methods
>
> invoke virtual：instance methods call（virtual dispatch）
>
> invoke interface：checking interface implement
>
> invoke static：call static methods
>
> invoke dynamic：dynamic language runs on JVM
>
> methods signature： \<class name: return type method name(param1 type, param2 type,...)>

## Control Flow Graph

* Building Control Flow Graph（CFG）
* CFG serves as the basic structure for static analysis
* The node in CFG can be an individual 3AC（or a Basic Block 【BB】）

Basic Blocks are maximal sequences of consecutive three-address instructions

* It can be entered only at the beginning
* It can be exited only at the end
* ![image-20240312203058036](https://1239337109-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F2HLPOkuOfb7iyCzDJ8vA%2Fuploads%2Fgit-blob-a31851a49953d71e0d42225ed0a76610a544a635%2Fimage-20240312203058036.png?alt=media)

![image-20230403093852162](https://1239337109-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F2HLPOkuOfb7iyCzDJ8vA%2Fuploads%2Fgit-blob-51e870af08f8ba9482619ade0a2fcdb82bec4058%2Fimage-20230403093852162.png?alt=media)

![image-20230403094033043](https://1239337109-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F2HLPOkuOfb7iyCzDJ8vA%2Fuploads%2Fgit-blob-fd2e8909e80aa793bce9e2db31899a76f8ea1377%2Fimage-20230403094033043.png?alt=media)

Build CFG through BBs

The nodes of CFG are basic blocks

* There is an edge from block A to block B if and only if
  * There is a conditional or unconditional jump from the end of A to the beginning of B
  * B immediately follows A in the original order of instructions and A does not end in an unconditional jump
  * A is a predecessor of B and B is a successor of A
* It is normal to replace the jumps to instruction labels by jumps to basic blocks
* Usually we add two nodes, Entry and Exit
  * An edge from Entry to the BB containing the first instruction of IR
  * An edge to Exit from any BB containing an instruction that could be the last instruction of IR

![image-20230403094332792](https://1239337109-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F2HLPOkuOfb7iyCzDJ8vA%2Fuploads%2Fgit-blob-0e8034e3ae37d1ad457371a0699480bd76c8a5e0%2Fimage-20230403094332792.png?alt=media)
