Submission Due: the beginning of your lab session two weeks from now
In the fourth laboratory, students will be constructing a source code scanner for a C-like language.
Your scanner must be generated by the Lex lexical analyzer generator. For this laboratory you must recognize and print out the tokens listed below. You can check your progress by downloading the example input file and its associated output file.
The following is a list of C tokens that you must recognize. If the action is a word in all capitals then you should print the word out and then return a token constant with the same name. If the action is a character then that character should be printed out and then returned. If the action is "ignored" then you should do nothing. The following tables give a description of the syntax that should be allowed for each given token.
C Comments:
Action |
Syntax |
Action |
Syntax |
---|---|---|---|
ignored |
/*...*/ |
ignored |
//... |
C Preprocessor Directives:
Action |
Syntax |
---|---|
PREPROC |
# ... |
C Punctuation:
Action |
Syntax |
Action |
Syntax |
---|---|---|---|
( |
( |
) |
) |
{ |
{ |
} |
} |
[ |
[ |
] |
] |
, |
, |
; |
; |
C Identifiers:
Action |
Syntax |
---|---|
ID |
Starts with a letter followed by any number of letters, numbers, and underscores |
C Literal Values:
Action |
Syntax |
Action |
Syntax |
---|---|---|---|
INTVAL |
Decimal, octal, or hexadecimal |
FLTVAL |
Decimal value followed by dot followed by decimal value followed by 'f' |
DBLVAL |
Decimal value followed by dot followed by decimal value |
STRVAL |
"..." |
CHARVAL |
'...' |
C Primitive Types:
Action |
Syntax |
Action |
Syntax |
---|---|---|---|
VOID |
void |
CHAR |
char |
SHORT |
short |
INT |
int |
LONG |
long |
FLOAT |
float |
DOUBLE |
double |
C Operators:
Action |
Syntax |
Action |
Syntax |
Action |
Syntax |
---|---|---|---|---|---|
EQ |
== |
NE |
!= |
GE |
>= |
LE |
<= |
GT |
> |
LT |
< |
ADD |
+ |
SUB |
- |
MUL |
* |
DIV |
/ |
MOD |
% |
OR |
|| |
AND |
&& |
BITOR |
| |
BITAND |
& |
BITXOR |
^ |
NOT |
! |
COM |
~ |
LSH |
<< |
RSH |
>> |
SET |
= |
SETADD |
+= |
SETSUB |
-= |
SETMUL |
*= |
SETDIV |
/= |
SETMOD |
%= |
SETOR |
|= |
SETAND |
&= |
SETXOR |
^= |
SETLSH |
<<= |
SETRSH |
>>= |
C Control Flow:
Action |
Syntax |
Action |
Syntax |
---|---|---|---|
RETURN |
return |
DO |
do |
WHILE |
while |
FOR |
for |
SWITCH |
switch |
CASE |
case |
DEFAULT |
default |
IF |
if |
ELSE |
else |
CONTINUE |
continue |
BREAK |
break |
GOTO |
goto |
C Keywords:
Action |
Syntax |
Action |
Syntax |
---|---|---|---|
UNSIGNED |
unsigned |
TYPEDEF |
typedef |
STRUCT |
struct |
UNION |
union |
CONST |
const |
STATIC |
static |
EXTERN |
extern |
AUTO |
auto |
REGISTER |
register |
SIZEOF |
sizeof |
When you demonstrate your solution for this lab, you will do so using both the provided test input as well as an input that you will not have access to ahead of time. Be sure your program handles all of the cases given above.
Speaking generally, C-like languages treat whitespace identically, regardless of how it's created. One tab or four spaces, it matters not. This is not the case for many other languages (Python, Haskell, YAML, etc.), where the construction, and placement, of whitespace is important. What complications does this introduce to our lexer if we need to account for such concerns? As a hint: consider writing a lexer for a Makefile, where only tabs and not spaces are allowed. How do the rules you have written change?