Computer Forums

Member Login

Remember Me? Sign Up! | Forgot Password
 
Slogan
 
Closed Thread
Old 07-19-2005, 07:32 PM   #1 (permalink)
 
Newb Techie

Join Date: Feb 2005

Posts: 42

strifer

Default Compiler Help

Hello everyone! I'm kinda lost and dont know where to start. We are asked to make a compiler and the first part would be the lexical analyzer. I search the net and found out that it would be a program that would give tokens to a series of words, characters or any combination of both provided what you intend it to do.
Ok, what if for example I have 6 token categories:

1. Reserved words : pre defined
2. Operators : +/- etc
3. Variables :
4. Strings : all statements enclosed with ' or "
5. invalid statement :invalid
6. numbers : ex. 3, 3.1

How do i check a certain user input that would recognize if the series of inputs are: reserved words, operators, etc. I dont really know where to start... This is to be done in C++ btw. If you happen to know how or know any good links that could help would be greatly appreciated. Thanks!
strifer is offline  
Old 07-19-2005, 08:44 PM   #2 (permalink)
 
Ultra Techie

Join Date: Jul 2005

Posts: 530

TheHeadFL

Send a message via AIM to TheHeadFL
Default

Well, when I did this back in my OS/Compiler design classes, we had to use lex and yacc to do it. (Tools that generate lexical analyzer and grammar parser)

But basically, are you asking how to write your own lexical analyzer? Well, I am not sure of the STL libraries in C++ provide a tokenizer, but I am pretty sure that they do. You would need to parse all the tokens in your input and generate a big list of token objects.

So, basically, you break down your input into all the corresponding tokens.

[Keep in mind all of the following lexical analyzer commentary I am writing here is for writing a homemade easy rigged-type version. Normally this would be done using regular expressions but that is a whole OTHER can of worms]

You let the STL library function (assuming it exists, otherwise, just go grab a library off google to help with this) find the actual individual tokens, then you compare each token and determine the type of token it is (i.e. reserved word, operator, whatever) and assign it to a token object with a corresponding token type attribute. You are going to need to define some rules for the tokens, in other words, what are your reserved words, and, can your variables start with a number or not. You're also going to need to tell the built-in tokenizer to stop parsing text when it encounters things like whitespace, your operators, etc.

Then, once you have all of these things worked out, you've got your input parsed into a bunch of token objects that have a corresponding type which fits into a grammar somehow. So, to represent, you might have a list like <reserved word> <variable> <operator> <variable>, etc.

Then, you take all of this and run it through a parser which can go through all the tokens you've classified and determine if the input matches your grammar, which you would have to define somehow. Mostly this is done in formal language definition form, and the grammar parser is made by yacc or bison or something. But since you're going to do it manually, you'll want to set up a bunch of types of things....

So in your rules specification, you would want to have maybe... (Forgive my syntax, its been a LONG time, but you get the general idea)
<expression> = <expression> <operator> <variable> | <expression> <operator> <expression> | <variable> <operator> <variable>.

And so on and so forth. You break your language down into that kind of grammar and such, and you write a parser to examine the tokens and fit them into the grammar. If they don't fit, you have a syntax error.

I hope that I kinda made it somewhat clearer, but if I haven't I apologize. Writing lexical analyzers and grammar parsers is a huge pain in the rear and basically few people do this kind of thing from scratch anymore. There isn't any sense in reinventing the wheel. But, of course, writing a compiler is already a pretty complicated thing to do, so this is just another aspect of that.
__________________
Desktop machine: 2 x Opteron 246, Asus K8N-DL, 2GB PC3200 ECC Reg., XFX GeForce 6600GT, 74gb WD Raptor, 2 x 19\" LCDs, Windows XP x64
Server machine: Intel P4 3.0GHz 2MB EM64T, ECS i865pe, 1GB PC3200, 36gb WD Raptor, Windows Server 2003
Laptop: Dell Inspiron 9100 (Intel P4 3.2GHz 1MB Prescott, i865pe, 512MB PC3200, Mobility Radeon 9700, DVD+R/DL Burner), Windows XP
Linux: P3 450Mhz, 386MB ram, Slackware 10.1 (Running mySQL/Apache)
TheHeadFL is offline  
Old 07-20-2005, 02:06 AM   #3 (permalink)
 
Monster Techie

Join Date: May 2004

Location: /usr/root/mn/us

Posts: 1,121

bla!! is on a distinguished road

Default

Personally what I would do, keeping in mind this is the simple and very inefficient way of doing it, would be to setup a switch statement.

For each token, setup a corresponding segment in your switch statement that will handle what you want done.

for example

Code:
switch Token

on OPERATOR {
    instert parse code here and call corresponding function
}
I apologize for any coding errors in that, I'm quite rusty when it comes to coding.
__________________
<br>
Its a frigging Laptop, not a Labtop!!!!
bla!! is offline  
Old 07-20-2005, 06:00 AM   #4 (permalink)
 
Ultra Techie

Join Date: Jul 2005

Posts: 530

TheHeadFL

Send a message via AIM to TheHeadFL
Default

That would probably be a good way to structure the core of the lexical analyzer.

Just to clean up your idea since he is using C++, he can use it like:

Code:
enum myTokens {RESERVED_WORD, OPERATOR, VARIABLE, STRING, INVALID, NUMBER};

enum myTokens token_type;
char* token;

while(token = getNextToken())  // Your token function, whatever it is
{

  token_type = getTokenType(token);  // Your token examining function

  switch (token_type)
  {

      case RESERVED_WORD:  // some code here
           break;
      case OPERATOR:  // some code here
           break;
      /// etc, etc
      default:  // in case you got some kind of invalid data
  }

}

__________________
Desktop machine: 2 x Opteron 246, Asus K8N-DL, 2GB PC3200 ECC Reg., XFX GeForce 6600GT, 74gb WD Raptor, 2 x 19\" LCDs, Windows XP x64
Server machine: Intel P4 3.0GHz 2MB EM64T, ECS i865pe, 1GB PC3200, 36gb WD Raptor, Windows Server 2003
Laptop: Dell Inspiron 9100 (Intel P4 3.2GHz 1MB Prescott, i865pe, 512MB PC3200, Mobility Radeon 9700, DVD+R/DL Burner), Windows XP
Linux: P3 450Mhz, 386MB ram, Slackware 10.1 (Running mySQL/Apache)
TheHeadFL is offline  
 
Closed Thread

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On