Computers |
|
| | #1 (permalink) |
| Newb Techie Join Date: Feb 2005
Posts: 42
| Hello everyone! I'm kinda lost and dont know where to start. We are asked to make a compiler and the first part would be the lexical analyzer. I search the net and found out that it would be a program that would give tokens to a series of words, characters or any combination of both provided what you intend it to do. Ok, what if for example I have 6 token categories: 1. Reserved words : pre defined 2. Operators : +/- etc 3. Variables : 4. Strings : all statements enclosed with ' or " 5. invalid statement :invalid 6. numbers : ex. 3, 3.1 How do i check a certain user input that would recognize if the series of inputs are: reserved words, operators, etc. I dont really know where to start... This is to be done in C++ btw. If you happen to know how or know any good links that could help would be greatly appreciated. Thanks! ![]()
__________________ |
| | |
| | #2 (permalink) |
| Ultra Techie | Well, when I did this back in my OS/Compiler design classes, we had to use lex and yacc to do it. (Tools that generate lexical analyzer and grammar parser) But basically, are you asking how to write your own lexical analyzer? Well, I am not sure of the STL libraries in C++ provide a tokenizer, but I am pretty sure that they do. You would need to parse all the tokens in your input and generate a big list of token objects. So, basically, you break down your input into all the corresponding tokens. [Keep in mind all of the following lexical analyzer commentary I am writing here is for writing a homemade easy rigged-type version. Normally this would be done using regular expressions but that is a whole OTHER can of worms] You let the STL library function (assuming it exists, otherwise, just go grab a library off google to help with this) find the actual individual tokens, then you compare each token and determine the type of token it is (i.e. reserved word, operator, whatever) and assign it to a token object with a corresponding token type attribute. You are going to need to define some rules for the tokens, in other words, what are your reserved words, and, can your variables start with a number or not. You're also going to need to tell the built-in tokenizer to stop parsing text when it encounters things like whitespace, your operators, etc. Then, once you have all of these things worked out, you've got your input parsed into a bunch of token objects that have a corresponding type which fits into a grammar somehow. So, to represent, you might have a list like <reserved word> <variable> <operator> <variable>, etc. Then, you take all of this and run it through a parser which can go through all the tokens you've classified and determine if the input matches your grammar, which you would have to define somehow. Mostly this is done in formal language definition form, and the grammar parser is made by yacc or bison or something. But since you're going to do it manually, you'll want to set up a bunch of types of things.... So in your rules specification, you would want to have maybe... (Forgive my syntax, its been a LONG time, but you get the general idea) <expression> = <expression> <operator> <variable> | <expression> <operator> <expression> | <variable> <operator> <variable>. And so on and so forth. You break your language down into that kind of grammar and such, and you write a parser to examine the tokens and fit them into the grammar. If they don't fit, you have a syntax error. I hope that I kinda made it somewhat clearer, but if I haven't I apologize. Writing lexical analyzers and grammar parsers is a huge pain in the rear and basically few people do this kind of thing from scratch anymore. There isn't any sense in reinventing the wheel. But, of course, writing a compiler is already a pretty complicated thing to do, so this is just another aspect of that.
__________________ Desktop machine: 2 x Opteron 246, Asus K8N-DL, 2GB PC3200 ECC Reg., XFX GeForce 6600GT, 74gb WD Raptor, 2 x 19\" LCDs, Windows XP x64 Server machine: Intel P4 3.0GHz 2MB EM64T, ECS i865pe, 1GB PC3200, 36gb WD Raptor, Windows Server 2003 Laptop: Dell Inspiron 9100 (Intel P4 3.2GHz 1MB Prescott, i865pe, 512MB PC3200, Mobility Radeon 9700, DVD+R/DL Burner), Windows XP Linux: P3 450Mhz, 386MB ram, Slackware 10.1 (Running mySQL/Apache) |
| | |
| | #3 (permalink) |
| Monster Techie Join Date: May 2004 Location: /usr/root/mn/us
Posts: 1,121
| Personally what I would do, keeping in mind this is the simple and very inefficient way of doing it, would be to setup a switch statement. For each token, setup a corresponding segment in your switch statement that will handle what you want done. for example Code: switch Token
on OPERATOR {
instert parse code here and call corresponding function
}
__________________ ![]() Its a frigging Laptop, not a Labtop!!!! |
| | |
| | #4 (permalink) |
| Ultra Techie | That would probably be a good way to structure the core of the lexical analyzer. Just to clean up your idea since he is using C++, he can use it like: Code: enum myTokens {RESERVED_WORD, OPERATOR, VARIABLE, STRING, INVALID, NUMBER};
enum myTokens token_type;
char* token;
while(token = getNextToken()) // Your token function, whatever it is
{
token_type = getTokenType(token); // Your token examining function
switch (token_type)
{
case RESERVED_WORD: // some code here
break;
case OPERATOR: // some code here
break;
/// etc, etc
default: // in case you got some kind of invalid data
}
}
__________________ Desktop machine: 2 x Opteron 246, Asus K8N-DL, 2GB PC3200 ECC Reg., XFX GeForce 6600GT, 74gb WD Raptor, 2 x 19\" LCDs, Windows XP x64 Server machine: Intel P4 3.0GHz 2MB EM64T, ECS i865pe, 1GB PC3200, 36gb WD Raptor, Windows Server 2003 Laptop: Dell Inspiron 9100 (Intel P4 3.2GHz 1MB Prescott, i865pe, 512MB PC3200, Mobility Radeon 9700, DVD+R/DL Burner), Windows XP Linux: P3 450Mhz, 386MB ram, Slackware 10.1 (Running mySQL/Apache) |
| | |