IR & CFG

Compiler and Static Analyzer

Compiler Structure

How does Compiler work?

传统编译器架构:

  • Frontend:前端

    • Lexical Analysis:词法分析

    • Syntax Analysis:语法分析

    • Semantic Analysis:语义分析

    • Translator:生成中间代码(Intermediate Representation)

  • Optimizer:优化器

  • Backend:后端

    Code Generator生成机器码

AST vs IR

why IR is better for static analysis than AST?

AST:

  • high-level and closed to grammar structure

  • usually language dependent

  • suitable for fast type checking

  • lack of control flow information

IR:

  • low-level and closed to machine code

  • usually language independent

  • compact and uniform

  • contains control flow information

  • usually considered as the basis for static analysis

3-Address Code

3-Address Code(3AC)

  • There is at most one operator on the right side of an instruction

  • Each 3AC contains at most 3 address

note:There is no fixed realization of the 3AC.

Address can means:

  • Name

  • Constant

  • Compiler-generated temporary variable

Some Common 3AC Forms:

Soot is one of the most popular static analysis framework for Java

Soot's IR is Jimple:typed 3-address code

JVM complement:

invoke special:call constructor、superclass methods、private methods

invoke virtual:instance methods call(virtual dispatch)

invoke interface:checking interface implement

invoke static:call static methods

invoke dynamic:dynamic language runs on JVM

methods signature: <class name: return type method name(param1 type, param2 type,...)>

Control Flow Graph

  • Building Control Flow Graph(CFG)

  • CFG serves as the basic structure for static analysis

  • The node in CFG can be an individual 3AC(or a Basic Block 【BB】)

Basic Blocks are maximal sequences of consecutive three-address instructions

  • It can be entered only at the beginning

  • It can be exited only at the end

Build CFG through BBs

The nodes of CFG are basic blocks

  • There is an edge from block A to block B if and only if

    • There is a conditional or unconditional jump from the end of A to the beginning of B

    • B immediately follows A in the original order of instructions and A does not end in an unconditional jump

    • A is a predecessor of B and B is a successor of A

  • It is normal to replace the jumps to instruction labels by jumps to basic blocks

  • Usually we add two nodes, Entry and Exit

    • An edge from Entry to the BB containing the first instruction of IR

    • An edge to Exit from any BB containing an instruction that could be the last instruction of IR

Last updated