For this project, you will write the lexical analysis phase (i.e., the "scanner") of a simple compiler for a subset of the language "LOLcode". We will start with only one variable type ("NUMBR"), basic math, and the print command to output results; basically it will be little more than a simple calculator. Over the next two projects we will turn this into a working compiler, and in the five projects following that, we will expand the functionality and efficiency of the language.
You will be writing a function, called "build_lexer", that returns a lexer object. This lexer has a method called "lex", which accepts a string and returns generator/list of RPLY tokens. Your constructed lexer will remove unneeded whitespace and comments, and categorize each word or symbol as a token. A token represents an atomic unit to be parsed, and is typically realized as one or a short series of characters in a source file such as "x", "+", or "42". You will be making use of RPLY's LexerGenerator to solve this project.
We will develop our compiler using the library RPLY (which is based on Lex and Yacc). This project will only require the use of lex, which will handle lexical analysis for us, taking as input a set of regular expressions associated with each token type. Both of these tools are included in the Python3 package "rply", which is included as a folder in the starter code.
Your scanner should be able to identify each of the following tokens that it encounters:
|PRIMITIVE_TYPE||Data types: currently "NUMBR", "NUMBAR", "LETTR", and "TROOF", but more types will be introduced in future projects.|
|BANG||The exclamation point character "!".|
|NUMBR_LITERAL||Any literal integer. These consist of sequences of digits, with no other characters except an single, optional, preceding hyphen. Examples: 12, 0, or -98|
|NUMBAR_LITERAL||Any literal floating point numbers. These consist of sequences of digits, with no other characters except a single decimal point and an optional preceding hyphen. The decimal point exists must be succeeded by at least one digit. Examples: 123.456, 10.0, -0.0001, .98, and 89.4|
|LETTR_LITERAL||Any literal character. These are a single quote followed by any single character, followed by another single quote. Example: 'c'|
|TROOF_LITERAL||A boolean value. These are either WIN or FAIL.|
|YARN_LITERAL||A literal string: A double quote followed by a series of internal characters and ending in a second double quote. The internal characters can be anything except a double quote or a newline. Note: in a future project we will implement escape characters. Example: "This is my string"|
|MATH_BINARY_OPERATOR||Two operand math operators: SUM, DIFF, PRODUKT, QUOSHUNT, BIGGR, SMALLR|
|MATH_UNARY_OPERATOR||One operand math operators: FLIP, SQUAR|
|LOGICAL_BINARY_OPERATOR||Two operand logic operators: BOTH, EITHER, WON|
|LOGICAL_UNARY_OPERATOR||One operand math operators: NOT|
|LOGICAL_VARIABLE_OPERATOR||Variable operand logic operators: ALL, ANY|
|COMPARISON_BINARY_OPERATOR||Two operand comparator operators: SAEM, DIFFRINT, FURSTSMALLR, FURSTBIGGR|
|ASSIGNMENT_OPERATOR||Assignment and mutate operators: UPPIN, NERFIN|
|NEWLINE||A newline character.|
|SINGLE_COMMENT||Everything on a single line following an BTW.|
|MULTI_COMMENT||Everything between OBTW and TLDR.|
|IDENTIFIER||A sequence beginning with a letter, followed by a sequence of zero or more characters that may contain letters, numbers and underscores('_'). Currently just variable names, but in the future they will also be used for function names.|
|ERROR||An unknown character that does not match any of the tokens above.|
There also are numerous keywords (tokens whose name/type is the same as its value). These include:
A pattern is a rule by which a token is identified. It will be up to you as part of this project to identify the patterns associated with each token. A lexeme is an instance of a pattern from a source file. For example, "300" is a lexeme that would be categorized as the token NUMBR. On the other hand "my_var" is a lexeme that would get identified as a token of the type IDENTIFIER.
When multiple patterns match the text being processed, choose the one that produces the longest lexeme that starts at the current position. If two different patterns produce lexemes of the same length, choose the one that comes first in the list above. For example, the lexeme "my_NUMBR" might be incorrectly read as the IDENTIFIER "my_NUMBR" followed by the PRIMITIVE_TYPE "NUMBR", but "my_NUMBR" is longer than "my_", so it should be chosen. Likewise, the lexeme "SUM" could match either the pattern for MATH_BINARY_OPERATOR or the pattern for IDENTIFIER, but MATH_BINARY_OPERATOR should be chosen since it comes first in the list above.
Your function "build_lexer", should return a lexer object (an instance of RPLY's LexerGenerator). This object should have a "lex" method, that returns a list of RPLY Tokens when given a string.
Note: Comment tokens should not be returned by the lexer, instead they should be captured and ignored.
Here is an example LOLcode text (included in the starter code):
HAI 1.450 I HAS A result ITZ A NUMBR BTW I like apples result R 14 VISIBLE result OBTW This is a multiline comment TLDR KTHXBYE
The lexer object would generate this list of tokens if given the above string:
[ Token('HAI', 'HAI'), Token('NUMBAR_LITERAL', '1.450'), Token('NEWLINE', '\n'), Token('I', 'I'), Token('HAS', 'HAS'), Token('A', 'A'), Token('IDENTIFIER', 'result'), Token('ITZ', 'ITZ'), Token('A', 'A'), Token('PRIMITIVE_TYPE', 'NUMBR'), Token('NEWLINE', '\n'), Token('IDENTIFIER', 'result'), Token('R', 'R'), Token('NUMBR_LITERAL', '14'), Token('NEWLINE', '\n'), Token('NEWLINE', '\n'), Token('VISIBLE', 'VISIBLE'), Token('IDENTIFIER', 'result'), Token('NEWLINE', '\n'), Token('NEWLINE', '\n'), Token('KTHXBYE', 'KTHXBYE') ]
Your completed project (in a foler called Project1) MUST include: