(This is the older material - click here - for current specification content)

3. Lexical Analysis (2)

Consider a simple statement like this:

time = 5 + 3; // time to work out a value

If you are a novice programmer, this statement simply means you are calculating a value by adding two numbers together and assigning it to a variable called time.

Many programming environments such as Visual Basic, Delphi, C++ contain thier own text editor with which to write your program. When you write your code what you are actually creating a sequence of ASCII characters and these characters together are called 'source code'.

This 'source code' is loaded into the compiler and lexical analysis begins (the first stage of compilation).

First of all the compiler looks at the incoming character stream and tries to spot where one keyword ends and another starts.

For example time = . It does not know what the character stream 'time' means so it assumes it must be a symbol of some kind such as a variable or a constant. In computing terms a symbol is an entity to which a value can be assigned.

So now the compiler begins to form a token table in memory

lexeme	token type
time	IDENT

A 'lexeme' is the term used to describe a specific item that the lexical analysis software has separated from the rest of the incoming character stream (source code). In this case it creates a IDENT type token with the characters 'time' embedded in it.

In this particular compiler IDENT means a variable or a constant

(Editor note: Don't worry, you do not need to learn these token types, I am only including them to explain lexical analysis in general).

This token type informs the next stage of analysis (syntactic and code generation) to allocate some memory space to this item within the final executable file.

Next, it spots the '=' character and tokenises that into its proper token, as it understands what '=' means

lexeme	token type
time	IDENT
=	ASSIGN_OP

Now it spots a number followed by a plus sign and another number, so the token table now looks like this:

lexeme	token type
time	IDENT
=	ASSIGN_OP
5	NUMBER
+	ADD_OP
3	NUMBER

Spaces surrounding the line and in-between words are called 'whitespace' and are removed by lexical analysis. In addition, if it spots the symbol for comments (often two slashes // ) it will ignore any characters after this until the end-of-line character is encountered.

In this way, lexical analysis is stripping the source code into its component parts. Ready for the next stage.

REMEMBER: you do not need to learn these tables.

Challenge see if you can find out one extra fact on this topic that we haven't already told you

Click on this link: Lexical Analysis