Project 1: Lexical Analysis

For this project, you will write the lexical analysis phase (i.e., the "scanner") of a simple compiler for a subset of the language "LOLcode". We will start with only one variable type ("NUMBR"), basic math, and the print command to output results; basically it will be little more than a simple calculator. Over the next two projects we will turn this into a working compiler, and in the five projects following that, we will expand the functionality and efficiency of the language.

Description

You will be writing a function, called "build_lexer", that returns a lexer object. This lexer has a method called "lex", which accepts a string and returns generator/list of RPLY tokens. Your constructed lexer will remove unneeded whitespace and comments, and categorize each word or symbol as a token. A token represents an atomic unit to be parsed, and is typically realized as one or a short series of characters in a source file such as "x", "+", or "42". You will be making use of RPLY's LexerGenerator to solve this project.

We will develop our compiler using the library RPLY (which is based on Lex and Yacc). This project will only require the use of lex, which will handle lexical analysis for us, taking as input a set of regular expressions associated with each token type. Both of these tools are included in the Python3 package "rply", which is included as a folder in the starter code.

Your scanner should be able to identify each of the following tokens that it encounters:

Token Desciption
PRIMITIVE_TYPEData types: currently "NUMBR", "NUMBAR", "LETTR", and "TROOF", but more types will be introduced in future projects.
BANGThe exclamation point character "!".
NUMBR_LITERALAny literal integer. These consist of sequences of digits, with no other characters except an single, optional, preceding hyphen. Examples: 12, 0, or -98
NUMBAR_LITERALAny literal floating point numbers. These consist of sequences of digits, with no other characters except a single decimal point and an optional preceding hyphen. The decimal point exists must be succeeded by at least one digit. Examples: 123.456, 10.0, -0.0001, .98, and 89.4
LETTR_LITERALAny literal character. These are a single quote followed by any single character, followed by another single quote. Example: 'c'
TROOF_LITERALA boolean value. These are either WIN or FAIL.
YARN_LITERALA literal string: A double quote followed by a series of internal characters and ending in a second double quote. The internal characters can be anything except a double quote or a newline. Note: in a future project we will implement escape characters. Example: "This is my string"
MATH_BINARY_OPERATORTwo operand math operators: SUM, DIFF, PRODUKT, QUOSHUNT, BIGGR, SMALLR
MATH_UNARY_OPERATOROne operand math operators: FLIP, SQUAR
LOGICAL_BINARY_OPERATORTwo operand logic operators: BOTH, EITHER, WON
LOGICAL_UNARY_OPERATOROne operand math operators: NOT
LOGICAL_VARIABLE_OPERATORVariable operand logic operators: ALL, ANY
COMPARISON_BINARY_OPERATORTwo operand comparator operators: SAEM, DIFFRINT, FURSTSMALLR, FURSTBIGGR
ASSIGNMENT_OPERATORAssignment and mutate operators: UPPIN, NERFIN
NEWLINE A newline character.
SINGLE_COMMENTEverything on a single line following an BTW.
MULTI_COMMENTEverything between OBTW and TLDR.
IDENTIFIERA sequence beginning with a letter, followed by a sequence of zero or more characters that may contain letters, numbers and underscores('_'). Currently just variable names, but in the future they will also be used for function names.
ERRORAn unknown character that does not match any of the tokens above.

There also are numerous keywords (tokens whose name/type is the same as its value). These include:

  • HAI
  • KTHXBYE
  • VISIBLE
  • GIMMEH
  • WHATEVR
  • I
  • HAS
  • A
  • ITZ
  • AN
  • R
  • OF
  • MKAY
  • BY

A pattern is a rule by which a token is identified. It will be up to you as part of this project to identify the patterns associated with each token. A lexeme is an instance of a pattern from a source file. For example, "300" is a lexeme that would be categorized as the token NUMBR. On the other hand "my_var" is a lexeme that would get identified as a token of the type IDENTIFIER.

When multiple patterns match the text being processed, choose the one that produces the longest lexeme that starts at the current position. If two different patterns produce lexemes of the same length, choose the one that comes first in the list above. For example, the lexeme "my_NUMBR" might be incorrectly read as the IDENTIFIER "my_NUMBR" followed by the PRIMITIVE_TYPE "NUMBR", but "my_NUMBR" is longer than "my_", so it should be chosen. Likewise, the lexeme "SUM" could match either the pattern for MATH_BINARY_OPERATOR or the pattern for IDENTIFIER, but MATH_BINARY_OPERATOR should be chosen since it comes first in the list above.

Output

Your function "build_lexer", should return a lexer object (an instance of RPLY's LexerGenerator). This object should have a "lex" method, that returns a list of RPLY Tokens when given a string.

Note: Comment tokens should not be returned by the lexer, instead they should be captured and ignored.

Example

Here is an example LOLcode text (included in the starter code):

HAI 1.450
I HAS A result ITZ A NUMBR BTW I like apples
result R 14

VISIBLE result
OBTW This is a 
multiline comment
TLDR
KTHXBYE

The lexer object would generate this list of tokens if given the above string:

[
Token('HAI', 'HAI'),
Token('NUMBAR_LITERAL', '1.450'),
Token('NEWLINE', '\n'),
Token('I', 'I'),
Token('HAS', 'HAS'),
Token('A', 'A'),
Token('IDENTIFIER', 'result'),
Token('ITZ', 'ITZ'),
Token('A', 'A'),
Token('PRIMITIVE_TYPE', 'NUMBR'),
Token('NEWLINE', '\n'),
Token('IDENTIFIER', 'result'),
Token('R', 'R'),
Token('NUMBR_LITERAL', '14'),
Token('NEWLINE', '\n'),
Token('NEWLINE', '\n'),
Token('VISIBLE', 'VISIBLE'),
Token('IDENTIFIER', 'result'),
Token('NEWLINE', '\n'),
Token('NEWLINE', '\n'),
Token('KTHXBYE', 'KTHXBYE')
]

Turning It In

Your completed project (in a foler called Project1) MUST include:

  • A project.py python file with a function named "build_lexer".
  • A README.txt file that includes feedback on the project, such as how long it took to complete. You should also include any external resources (attributions) that you used in its development.