



# **Compact Code Generation**







Processor Automated Synthesis by iTerative Analysis







# Integrated Instruction Selection and Register Allocation for Compact Code Generation Exploiting Freeform Mixing of 16- and 32-bit Instructions

Tobias Edler von Koch Igor Böhm and Björn Franke





Processor Automated Synthesis by iTerative Analysis







128-Mbit Flash 27.3mm<sup>2</sup> at 0.13µm







128-Mbit Flash 27.3mm<sup>2</sup> at 0.13µm



ARM Cortex M3 0.43mm<sup>2</sup> at 0.13µm







128-Mbit Flash 27.3mm<sup>2</sup> at 0.13µm



ARM Cortex M3 0.43mm<sup>2</sup> at 0.13µm



ENCORE Calton 0.15mm<sup>2</sup> at 0.13µm







128-Mbit Flash 27.3mm<sup>2</sup> at 0.13µm



ARM Cortex M3 0.43mm<sup>2</sup> at 0.13µm



ENCORE Calton 0.15mm<sup>2</sup> at 0.13µm

#### **RISC Architectures**

sacrifice code density in order to simplify implementation circuitry and decrease die area







### Solution to Code Size Problem

 Dual instruction sets providing 32-bit and 16-bit instruction encodings:





## Solution to Code Size Problem

 Dual instruction sets providing 32-bit and 16-bit instruction encodings:

ARM Thumb-2
ARM Thumb

microMIPS ARCompact





microMIPS

## Solution to Code Size Problem

 Dual instruction sets providing 32-bit and 16-bit instruction encodings:

# ARM Thumb-2 ARM Thumb

# **ARC**ompact

There's no such thing as a free lunch!





microMIPS

## Solution to Code Size Problem

 Dual instruction sets providing 32-bit and 16-bit instruction encodings:

# ARM Thumb-2 ARM Thumb

# ARCompact

- There's no such thing as a free lunch!
- 16-bit instructions come with constraints!





Only subset of registers accessible





- Only subset of registers accessible
- Explicit instructions necessary to switch between 16-bit and 32-bit modes





- Only subset of registers accessible
- Explicit instructions necessary to switch between 16-bit and 32-bit modes
- Reduced size of immediate operands







- Only subset of registers accessible
- Explicit instructions necessary to switch between 16-bit and 32-bit modes
- Reduced size of immediate operands
- Not every 32-bit instruction has a 16-bit counterpart





- Only subset of registers accessible
- Explicit instructions necessary to switch between 16-bit and 32-bit modes
- Reduced size of immediate operands
- Not every 32-bit instruction has a 16-bit counterpart
- Free mixing of 16- and 32-bit encodings not always possible





# **ARCompact** 16-bit Instruction Format Constraints

- Only subset of registers accessible
- Explicit instructions necessary to switch between 16-bit and 32-bit modes
- Reduced size of immediate operands
- Not every 32-bit instruction has a 16-bit counterpart
- Free mixing of 16- and 32-bit encodings not always possible





32-Bit Only

```
ld r2,[sp,0]
ld r3,[sp,4]
```

```
ld r4,[sp,8]
add r2,r2,r3
asl r2,r2,2
sub r2,r2,r4
```

24 bytes







#### 32-Bit Only

Mixed Mode Aggressive

ld r2,[sp,0]
ld r3,[sp,4]

ld**\_s** r2,[sp,0] ld**\_s** r3,[sp,4]

ld r4,[sp,8]
add r2,r2,r3
asl r2,r2,2
sub r2,r2,r4

ld\_s rX,[sp,8]
add\_s r2,r2,r3
asl\_s r2,r2,2
sub\_s r2,r2,rX

24 bytes

12 bytes

Instruction Selection







32-Bit Only

Mixed Mode Aggressive

ld r2,[sp,0] ld r3,[sp,4]

ld r4,[sp,8]
add r2,r2,r3
asl r2,r2,2
sub r2,r2,r4

ld\_s r2,[sp,0]
ld\_s r3,[sp,4]
mov r4,r1
ld\_s r1,[sp,8]
add\_s r2,r2,r3
asl\_s r2,r2,2
sub\_s r2,r2,r1
mov r4,r1

24 bytes

20 bytes

Instruction Selection Register Allocation







32-Bit Only

Mixed Mode Aggressive

Mixed Mode Integrated

ld r2,[sp,0] ld r3,[sp,4]

ld r4, [sp, 8] add r2,r2,r3asl r2, r2, 2 sub r2,r2,r4

ld s r2,[sp,0] ld **s** r3,[sp,4] mov r4,r1 ld **s r1**,[sp,8] add  $\mathbf{s}$  r2,r2,r3 asl **s** r2,r2,2 sub s r2, r2, r1 mov r4,r1

ld s r2,[sp,0] ld s r3,[sp,4]

ld r4,[sp,8] add s r2, r2, r3 asl **s** r2,r2,2 sub r2,r2,r4

24 bytes

20 bytes

6 bytes







# Efficient Compact Code Generation is an Integrated Instruction Selection and Register Allocation Problem!





ECC - EnCore C Compiler based on commercial CoSy compiler by ACE©.







ECC - EnCore C Compiler based on commercial CoSy compiler by ACE©.

Compiler backend supports **two** compact code generation strategies

**Opportunistic** 







ECC - EnCore C Compiler based on commercial CoSy compiler by ACE©.

Compiler backend supports **two** compact code generation strategies







ECC - EnCore C Compiler based on commercial CoSy compiler by ACE©.

Compiler backend supports **two** compact code generation strategies







ECC - EnCore C Compiler based on commercial CoSy compiler by ACE©.

Compiler backend supports **two** compact code generation strategies







ECC - EnCore C Compiler based on commercial CoSy compiler by ACE©.

Compiler backend supports **two** compact code generation strategies

**Opportunistic** MIR ↓ Match / Cover 32-bit patterns only LIR 1 **Register Allocation** LIR, **Code Emission** 32-bit/16-bit instructions ASM,





ECC - EnCore C Compiler based on commercial CoSy compiler by ACE©.









ECC - EnCore C Compiler based on commercial CoSy compiler by ACE©.









ECC - EnCore C Compiler based on commercial CoSy compiler by ACE©.









ECC - EnCore C Compiler based on commercial CoSy compiler by ACE©.









ECC - EnCore C Compiler based on commercial CoSy compiler by ACE©.









ECC - EnCore C Compiler based on commercial CoSy compiler by ACE©.









ECC - EnCore C Compiler based on commercial CoSy compiler by ACE©.









#### Feedback-Guided Instruction Selection

vX ... Virtual Register rX ... Physical Register

#### **MIR**







#### Feedback-Guided Instruction Selection



aggressively select 16-bit instructions







register allocator constrained to 16-bit accessible register set

aggressively select 16-bit instructions

























**Feedback Guided Code Generation** 

instructions based on feedback

fewer constraints for register allocator









**Feedback Guided Code Generation** 

fewer constraints for register allocator







# Evaluation - Experimental Setup

| BenchMarks | EEMBC I.I |
|------------|-----------|
|------------|-----------|

| Core                | ARC750D              |
|---------------------|----------------------|
| Pipeline            | 7-stage interlocked  |
| Execution Order     | In-Order             |
| Branch Prediction   | Yes                  |
| ISA                 | ARCompact            |
| Floating Point      | Hardware             |
| Memory<br>Subsystem |                      |
| LI Cache            | Yes                  |
| Instruction         | 8k/2-way associative |
| Data                | 8k/2-way associative |
| L2 Cache            | No                   |



| Simulator          | ArcSim                                        |
|--------------------|-----------------------------------------------|
| Simulation Mode    | Full System, Cycle Accurate                   |
| Accuracy           | Cycle accurate mode validated against real HW |
| Options            | Default                                       |
| I/O & System Calls | Emulated                                      |







GCC GCC

(avg: 2.7%)

### Evaluation - Code Size Reduction



Baseline: plain32-bit code. Feedback-guided selection Opportunistic selection (avg: 16.7%) Opportunistic selection (avg: 15.4%)







































# informatics



# Why does the simple Opportunistic Mode perform so well?







# Why does the simple Opportunistic Mode perform so well?







# Why does the simple Opportunistic Mode perform so well?

register allocator selects registers with lower
 ID from set of possible registers







# Why does the simple Opportunistic Mode perform so well?

- register allocator selects registers with lower
   ID from set of possible registers
- calling conventions constrain register allocator

















































# Conclusions

 Compact Code Generation is an integrated Instruction Selection and Register Allocation problem.





# Conclusions

- Compact Code Generation is an integrated Instruction Selection and Register Allocation problem.
- While our simple opportunistic mode works well, our feedback-directed mode produces more consistent results and does not rely on calling conventions or register-allocation implementations.





# Conclusions

- Compact Code Generation is an integrated Instruction Selection and Register Allocation problem.
- While our simple opportunistic mode works well, our feedback-directed mode produces more consistent results and does not rely on calling conventions or register-allocation implementations.
- Our scheme is the first one demonstrating that small code size can be achieved whilst improving performance.





#### http://groups.inf.ed.ac.uk/pasta/

#### Thanks!

For more information visit our website or search for the term 'PASTA project'.





University Homepage School Homepage School Contacts School Search



Processor Automated Synthesis by iTerative Analysis Project

In the PASTA project we seek to automate the design and optimisation of customisable embedded processors. We do this by creating tools that are able to learn about the physical characteristics of the underlying silicon technology and use that knowledge to synthesise the structure of an embedded processor. As the processor is now a flexible entity without a pre-defined instruction set, the compiler for that processor must also be synthesised. Furthermore, the code optimisations that the compiler performs when translating from source code to the synthetic architecture, must also be synthesised.



The three main areas for automated synthesis are:

- The processor architecture,
- the micro-architecture, and
- the compiler.



However, the information on which to make automated decisions in each case will be different. At the micro-architecture level we need to know how each microarchitecture option translates into speed, energy and die area. At the architectural level we need to know how each instruction set option translates into clock cycles of execution time, and at the compiler level we need to know how each optimisation reduces the overall number of instructions executed and maximises the effectiveness of the memory system.

The challenge of our research is that all three areas are inter-dependent and ultimately depend on the characteristics of the silicon on which the system is based. By blurring the boundary between hardware and software, and by automating

the process of adjusting that boundary, we hope to create a system that can perform design trade-offs in seconds when currently it takes an experienced designer several days.

Q - Search

News

Publications

People Seminars

Contact Internal Area

#### **PASTA Activities**

#### EnCore Tools

ECC Compiler GCC Compiler EnCore Simulator Verification/Co-Simulation

#### **HW Systems**

#### EnCore Processor

EnCore Calton

FPGA Platform ASIC Flow Chips & Boards

#### M.Sc. and UG Projects

#### Research Areas

Resource Sharing Configurable Flow Accelerators Automated ISE Mapping ISE

Power Prediction & Optimisation Interconnect Synthesis

Cache Optimisation for Embedd Systems

Banks

Fast Cycle-Approximate

