Lab 4: Source Code Scanner for (a subset of) the C Language

Submission Due: the beginning of your lab session two weeks from now

In the fourth laboratory, students will be constructing a source code scanner for a C-like language.

Lab Materials

Lab Source Files

Lab Assignment

Your scanner must be generated by the Lex lexical analyzer generator. For this laboratory you must recognize and print out the tokens listed below. You can check your progress by downloading the example input file and its associated output file.

The following is a list of C tokens that you must recognize. If the action is a word in all capitals then you should print the word out and then return a token constant with the same name. If the action is a character then that character should be printed out and then returned. If the action is "ignored" then you should do nothing. The following tables give a description of the syntax that should be allowed for each given token.

C Comments:

Action	Syntax	Action	Syntax
ignored	/.../	ignored	//...

C Preprocessor Directives:

Action	Syntax
PREPROC	# ...

C Punctuation:

Action	Syntax	Action	Syntax
(	(	)	)
{	{	}	}
[	[	]	]
,	,	;	;

C Identifiers:

Action	Syntax
ID	Starts with a letter followed by any number of letters, numbers, and underscores

C Literal Values:

Action	Syntax	Action	Syntax
INTVAL	Decimal, octal, or hexadecimal	FLTVAL	Decimal value followed by dot followed by decimal value followed by 'f'
DBLVAL	Decimal value followed by dot followed by decimal value	STRVAL	"..."
CHARVAL	'...'

C Primitive Types:

Action	Syntax	Action	Syntax
VOID	void	CHAR	char
SHORT	short	INT	int
LONG	long	FLOAT	float
DOUBLE	double

C Operators:

Action	Syntax	Action	Syntax	Action	Syntax
EQ	==	NE	!=	GE	>=
LE	<=	GT	>	LT	<
ADD	+	SUB	-	MUL	*
DIV	/	MOD	%	OR	\|\|
AND	&&	BITOR	\|	BITAND	&
BITXOR	^	NOT	!	COM	~
LSH	<<	RSH	>>	SET	=
SETADD	+=	SETSUB	-=	SETMUL	*=
SETDIV	/=	SETMOD	%=	SETOR	\|=
SETAND	&=	SETXOR	^=	SETLSH	<<=
SETRSH	>>=

C Control Flow:

Action	Syntax	Action	Syntax
RETURN	return	DO	do
WHILE	while	FOR	for
SWITCH	switch	CASE	case
DEFAULT	default	IF	if
ELSE	else	CONTINUE	continue
BREAK	break	GOTO	goto

C Keywords:

Action	Syntax	Action	Syntax
UNSIGNED	unsigned	TYPEDEF	typedef
STRUCT	struct	UNION	union
CONST	const	STATIC	static
EXTERN	extern	AUTO	auto
REGISTER	register	SIZEOF	sizeof

When you demonstrate your solution for this lab, you will do so using both the provided test input as well as an input that you will not have access to ahead of time. Be sure your program handles all of the cases given above.

Question

Speaking generally, C-like languages treat whitespace identically, regardless of how it's created. One tab or four spaces, it matters not. This is not the case for many other languages (Python, Haskell, YAML, etc.), where the construction, and placement, of whitespace is important. What complications does this introduce to our lexer if we need to account for such concerns? As a hint: consider writing a lexer for a Makefile, where only tabs and not spaces are allowed. How do the rules you have written change?