Lab 3: HTML-to-LaTeX Converter Using Lex

Submission Due: the beginning of your lab session two weeks from when it was assigned

In this laboratory, students will gain experience with the Lex lexical analyzer generator by constructing a program that will convert a subset of HTML into LaTeX.

Lab Materials

Lab Assignment

Begin the project by downloading the laboratory 3 source code, the LaTeX output reference, and the PDF output reference. The laboratory 3 source code contains a Make based build system, initial Lex source code for the HTML-to-LaTeX converter, and an example input file. The LaTeX output reference shows what the resulting LaTeX output should look like for the input example and the PDF output reference shows what the resulting PDF should look like. If you have never used LaTeX to typeset a document, then the open-content LaTeX book may be helpful (see slides).

To start, create the lab without making any changes by typing make on your terminal after having extracted the source code and after having entered the laboratory 3 directory. This should create an executable named html2latex inside of the directory. Once you have done this correctly you can attempt to convert the example input file into a PDF output file by typing make test on your terminal. This will fail with an error because you have not completed the laboratory.

The purpose of this laboratory is to convert the subset of HTML given below into correct LaTeX output that can then be converted into a PDF. The laboratory 3 source code that you downloaded has an example of how to use Lex to convert the HTML H1 tag and HTML comments into LaTeX. You will complete the laboratory by adding the required additional rules to the Lex source code file html2latex.l. To receive full points in this laboratory you will need to produce correct output by converting the following HTML tags into LaTeX:

HTML Comments

HTML comments begin with the tag . Any character between these two tags is considered a comment. This includes other HTML tags (i.e. HTML tags appearing inside of an HTML comment are considered part of the comment). The laboratory 3 source code contains Lex source code to handle HTML comments appropriately.
HTML Headings

You must convert HTML headers for H1, H2, and H3 into LaTeX source code. The HTML headers begin with the tags <h1>, <h2>, and <h3> respectively and end with the tags </h1>, </h2>, and </h3>. You will convert an HTML H1 header into a LaTeX section element, an HTML H2 header into a LaTeX subsection element, and an HTML H3 header into a LaTeX subsubsection element. You can assume that the content inside of an HTML heading is plain text (i.e. HTML tags cannot appear between the begin tag and end tag). The laboratory 3 source code contains Lex source code to handle HTML H1 headers correctly.
HTML Pre-Formatted Paragraphs

HTML pre-formatted paragraphs begin with the tag <pre> and end with the tag </pre>. Any character between these two tags, including white space, is considered to be part of the pre-formatted text and should be output exactly as seen (i.e. all text between the beginning and ending tags should be preserved exactly in the output). HTML pre-formatted paragraphs should be output using LaTeX verbatim environments.
HTML Paragraphs

HTML paragraphs begin with the tag <p> and end with the tag </p>. HTML paragraphs contain text that can be intermingled with certain HTML tags. The list of tags that you are required to support appear below. You should convert HTML paragraphs into LaTeX paragraphs, which is just text separated by a new line.
HTML Ordered and Unordered Lists

HTML ordered lists begin with the tag <ol> and end with the tag </ol>. Likewise, unordered lists begin with <ul> and end with </ul>. Both types of lists can contain only list items that begin with the tag <li> and end with the tag </li>. HTML list items contain the same type of text as HTML paragraphs. You should convert HTML ordered lists into LaTeX enumerate environments and HTML unordered lists into LaTeX itemize environments.

HTML paragraphs and HTML list items can both contain HTML tags inside of their text region. You are not required to support nesting of these HTML tags so you can assume the text between each of the tags is plain text. You must support the following tags inside of paragraphs and list items:

Small Text

HTML small text begins with the tag <small> and ends with the tag </small>. HTML small text should be converted to a LaTeX scriptsize element.
Large Text

HTML large text begins with the tag <big> and ends with the tag </big>. HTML large text should be converted to a LaTeX Large element.
Bold Text

HTML bold text begins with the tag <b> and ends with the tag </b>. HTML bold text should be converted to a LaTeX textbf element.
Italic Text

HTML italics text begins with the tag <i> and ends with the tag </i>. HTML italic text should be converted to a LaTeX textit element.
Strong Text

HTML strong text begins with the tag <strong> and ends with the tag </strong>. HTML strong text should be converted to a LaTeX textmd element.
Emphasized Text

HTML emphasized text begins with the tag <em> and ends with the tag </em>. HTML emphasized text should be converted to a LaTeX emph element.
Superscript Text

HTML superscript text begins with the tag <sup> and ends with the tag </sup>. HTML superscript text should be converted to a LaTeX textsuperscript.
Subscript Text

HTML subscript text begins with the tag <sub> and ends with the tag </sub>. HTML subscript text should be converted to a LaTeX textsubscript element.

When you demonstrate your solution for this lab, you will do so using both the provided test input as well as an input that you will not have access to ahead of time. Be sure your program handles all of the cases given above.

Lab Questions

In a paragraph or two, describe another input-to-output translation that Lex could be used for. What domain specific concerns need to be addressed? Can Lex handle them, and if so, how? In the interest of keeping things interesting, please avoid translations similar to the one we dealt with in this lab.