IR & CFG

Compiler and Static Analyzer

Compiler Structure

How does Compiler work?

传统编译器架构：

Frontend：前端
- Lexical Analysis：词法分析
- Syntax Analysis：语法分析
- Semantic Analysis：语义分析
- Translator：生成中间代码（Intermediate Representation）
Optimizer：优化器
Backend：后端
Code Generator生成机器码

why IR

why IR is better for static analysis than AST？

AST：

high-level and closed to grammar structure
usually language dependent
suitable for fast type checking
lack of control flow information

IR：

low-level and closed to machine code
usually language independent
compact and uniform
contains control flow information
usually considered as the basis for static analysis

Other candidates：

Java source code：

statements and classes can be nested

Java bytecode

advantages
- no nesting; one statement follows the other; looping/branches through jumps
- nested classes are "flattened" into normal classes
disadvantages
- no local variables: operations performed on operand stack
- too many bytecodes（more than 200，many of them are overloaded based on their type）

3-Address Code

3-Address Code（3AC）

There is at most one operator on the right side of an instruction
Each 3AC contains at most 3 address

note：There is no fixed realization of the 3AC.

Address can means：

Name
Constant
Compiler-generated temporary variable

Some Common 3AC Forms：

Soot is one of the most popular static analysis framework for Java
Soot's IR is Jimple：typed 3-address code

JVM complement：
invoke special：call constructor、superclass methods、private methods
invoke virtual：instance methods call（virtual dispatch）
invoke interface：checking interface implement
invoke static：call static methods
invoke dynamic：dynamic language runs on JVM
methods signature： <class name: return type method name(param1 type, param2 type,...)>

Control Flow Graph

Building Control Flow Graph（CFG）
CFG serves as the basic structure for static analysis
The node in CFG can be an individual 3AC（or a Basic Block 【BB】）

Basic Blocks are maximal sequences of consecutive three-address instructions

It can be entered only at the beginning
It can be exited only at the end

Build CFG through BBs

The nodes of CFG are basic blocks

There is an edge from block A to block B if and only if
- There is a conditional or unconditional jump from the end of A to the beginning of B
- B immediately follows A in the original order of instructions and A does not end in an unconditional jump
- A is a predecessor of B and B is a successor of A
It is normal to replace the jumps to instruction labels by jumps to basic blocks
Usually we add two nodes, Entry and Exit
- An edge from Entry to the BB containing the first instruction of IR
- An edge to Exit from any BB containing an instruction that could be the last instruction of IR

PreviousIntro NextDFA

Last updated 9 months ago

Was this helpful?