Lab 4: Source Code Scanner for (a subset of) the C Language

Submission Due: the beginning of your lab session two weeks from now

In the fourth laboratory, students will be constructing a source code scanner for a C-like language.

Lab Materials

  1. Lab Source Files

Lab Assignment

Your scanner must be generated by the Lex lexical analyzer generator. For this laboratory you must recognize and print out the tokens listed below. You can check your progress by downloading the example input file and its associated output file.

The following is a list of C tokens that you must recognize. If the action is a word in all capitals then you should print the word out and then return a token constant with the same name. If the action is a character then that character should be printed out and then returned. If the action is "ignored" then you should do nothing. The following tables give a description of the syntax that should be allowed for each given token.

C Comments:

Action

Syntax

Action

Syntax

ignored

/*...*/

ignored

//...

C Preprocessor Directives:

Action

Syntax

PREPROC

# ...

C Punctuation:

Action

Syntax

Action

Syntax

(

(

)

)

{

{

}

}

[

[

]

]

,

,

;

;

C Identifiers:

Action

Syntax

ID

Starts with a letter followed by any number of letters, numbers, and underscores

C Literal Values:

Action

Syntax

Action

Syntax

INTVAL

Decimal, octal, or hexadecimal

FLTVAL

Decimal value followed by dot followed by decimal value followed by 'f'

DBLVAL

Decimal value followed by dot followed by decimal value

STRVAL

"..."

CHARVAL

'...'

C Primitive Types:

Action

Syntax

Action

Syntax

VOID

void

CHAR

char

SHORT

short

INT

int

LONG

long

FLOAT

float

DOUBLE

double

C Operators:

Action

Syntax

Action

Syntax

Action

Syntax

EQ

==

NE

!=

GE

>=

LE

<=

GT

>

LT

<

ADD

+

SUB

-

MUL

*

DIV

/

MOD

%

OR

||

AND

&&

BITOR

|

BITAND

&

BITXOR

^

NOT

!

COM

~

LSH

<<

RSH

>>

SET

=

SETADD

+=

SETSUB

-=

SETMUL

*=

SETDIV

/=

SETMOD

%=

SETOR

|=

SETAND

&=

SETXOR

^=

SETLSH

<<=

SETRSH

>>=

C Control Flow:

Action

Syntax

Action

Syntax

RETURN

return

DO

do

WHILE

while

FOR

for

SWITCH

switch

CASE

case

DEFAULT

default

IF

if

ELSE

else

CONTINUE

continue

BREAK

break

GOTO

goto

C Keywords:

Action

Syntax

Action

Syntax

UNSIGNED

unsigned

TYPEDEF

typedef

STRUCT

struct

UNION

union

CONST

const

STATIC

static

EXTERN

extern

AUTO

auto

REGISTER

register

SIZEOF

sizeof

When you demonstrate your solution for this lab, you will do so using both the provided test input as well as an input that you will not have access to ahead of time. Be sure your program handles all of the cases given above.

Question

Speaking generally, C-like languages treat whitespace identically, regardless of how it's created. One tab or four spaces, it matters not. This is not the case for many other languages (Python, Haskell, YAML, etc.), where the construction, and placement, of whitespace is important. What complications does this introduce to our lexer if we need to account for such concerns? As a hint: consider writing a lexer for a Makefile, where only tabs and not spaces are allowed. How do the rules you have written change?