Levels of syntax
Computer
language syntax is generally distinguished into three levels:
- Words – the lexical level, determining how characters form tokens;
- Phrases – the grammar level, narrowly speaking, determining how tokens form phrases;
- Context – determining what objects or variables names refer to, if types are valid, etc.
Distinguishing
in this way yields modularity, allowing each level to be described and
processed separately, and often independently. First a lexer turns the linear
sequence of characters into a linear sequence of tokens; this is known as "lexical analysis" or
"lexing". Second the parser turns the linear sequence of tokens into
a hierarchical syntax tree; this is known as "parsing" narrowly speaking. Thirdly
the contextual analysis resolves names and checks types. This modularity is
sometimes possible, but in many real-world languages an earlier step depends on
a later step – for example, the lexer hack in C is because tokenization depends
on context. Even in these cases, syntactical analysis is often seen as
approximating this ideal model.
The parsing
stage itself can be divided into two parts: the parse tree or "concrete syntax tree"
which is determined by the grammar, but in general far too detailed for
practical use, and the abstract syntax tree (AST),
which simplifies this into a usable form. The AST and contextual analysis steps
can be considered a form of semantic analysis, as they are adding meaning and
interpretation to the syntax, or alternatively as informal, manual
implementations of syntactical rules that would be difficult or awkward to
describe or implement formally.
The levels
generally correspond to levels in the Chomsky hierarchy. Words are in a regular language, specified
in the lexical grammar, which is a Type-3 grammar,
generally given as regular expressions. Phrases
are in a context-free language (CFL),
generally a deterministic context-free language (DCFL), specified in a phrase
structure grammar, which is a Type-2 grammar, generally given as production
rules in Backus–Naur Form (BNF).
Phrase grammars are often specified in much more constrained grammars than full
context-free grammars, in order
to make them easier to parse; while the LR parser can parse any DCFL in linear time,
the simple LALR parser and even simpler LL parser are more efficient, but can only
parse grammars whose production rules are constrained. Contextual structure can
in principle be described by a context-sensitive
grammar, and automatically analyzed by means such as attribute grammars, though in general this step is
done manually, via name resolution rules and type checking, and implemented via a symbol table which stores names and types for
each scope.
Tools have
been written that automatically generate a lexer from a lexical specification written
in regular expressions and a parser from the phrase grammar written in BNF:
this allows one to use declarative programming, rather
than need to have procedural or functional programming. A notable example is
the lex-yacc pair. These automatically produce a concrete syntax tree; the
parser writer must then manually write code describing how this is converted to
an abstract syntax tree. Contextual analysis is also generally
implemented manually. Despite the existence of these automatic tools, parsing
is often implemented manually, for various reasons – perhaps the phrase
structure is not context-free, or an alternative implementation improves
performance or error-reporting, or allows the grammar to be changed more
easily. Parsers are often written in functional languages, such as Haskell, in
scripting languages, such as Python or Perl, or in C or C++.
Examples of errors
Main article:
Syntax error
As an
example, (add 1 1) is a
syntactically valid Lisp program (assuming the 'add' function exists, else name
resolution fails), adding 1 and 1. However, the following are invalid:
(_ 1 1) lexical error: '_' is not valid
(add 1 1 parsing error: missing closing ')'
(add 1 x) name error: 'x' is not bound
Note that
the lexer is unable to identify the error – all it knows is that, after
producing the token LEFT_PAREN, '(' the remainder of the program is invalid,
since no word rule begins with '_'. At the parsing stage, the parser has
identified the "list" production rule due to the '(' token (as the
only match), and thus can give an error message; in general it may be
ambiguous. At the context stage, the symbol 'x' exists in the syntax tree, but
has not been defined, and thus the context analyzer can give a specific error.
In a
strongly typed language, type errors are also a form of syntax error which is generally
determined at the contextual analysis stage, and this is considered a strength
of strong typing. For example, the following is syntactically invalid Python
code (as these are literals, the type can be determined at parse time):
'a' + 1
…as it adds
a string and an integer. This can be detected at the parsing (phrase analysis)
level if one has separate rules for "string + string" and
"integer + integer", but more commonly this will instead be parsed by
a general rule like "LiteralOrIdentifier + LiteralOrIdentifier" and
then the error will be detected at contextual analysis stage, where type
checking occurs. In some cases this validation is not done, and these syntax
errors are only detected at runtime.
In a weakly
typed language, where type can only be determined at runtime, type errors are
instead a semantic error, and can only be determined at runtime. The following
Python code:
a + b
is
ambiguous, and while syntactically valid at the phrase level, it can only be
validated at runtime, as variables do not have type in Python, only values do.
Syntax definition
Parse tree of Python
code with inset tokenization
The syntax
of textual programming languages is usually defined using a combination of regular expressions (for lexical structure)
and Backus–Naur Form (for grammatical structure)
to inductively specify syntactic categories
(nonterminals) and terminal symbols. Syntactic categories are defined by
rules called productions, which specify the values that belong to a
particular syntactic category.[1] Terminal symbols are the concrete
characters or strings of characters (for example keywords such as define, if, let, or void)
from which syntactically valid programs are constructed.
A language
can have different equivalent grammars, such as equivalent regular expressions
(at the lexical levels), or different phrase rules which generate the same
language. Using a broader category of grammars, such as LR grammars, can allow
shorter or simpler grammars compared with more restricted categories, such as
LL grammar, which may requires longer grammars with more rules. Different but
equivalent phrase grammars yield different parse trees, though the underlying
language (set of valid documents) is the same.
Example: Lisp
Below is a
simple grammar, defined using the notation of regular expressions and
Backus–Naur Form. It describes the syntax of Lisp, which
defines productions for the syntactic categories expression, atom,
number, symbol, and list:
expression ::= atom | list
atom ::= number | symbol
number ::= [+-]?['0'-'9']+
symbol ::= ['A'-'Z''a'-'z'].*
list ::= '(' expression* ')'
This grammar
specifies the following:
- an expression is either an atom or a list;
- an atom is either a number or a symbol;
- a number is an unbroken sequence of one or more decimal digits, optionally preceded by a plus or minus sign;
- a symbol is a letter followed by zero or more of any characters (excluding whitespace); and
- a list is a matched pair of parentheses, with zero or more expressions inside it.
Here the
decimal digits, upper- and lower-case characters, and parentheses are terminal
symbols.
The
following are examples of well-formed token sequences in this grammar: '12345', '()', '(a b c232
(1))'
Complex grammars
The grammar
needed to specify a programming language can be classified by its position in
the Chomsky hierarchy. The phrase grammar of most
programming languages can be specified using a Type-2 grammar, i.e., they are context-free grammars,[2] though the overall syntax is
context-sensitive (due to variable declarations and nested scopes), hence
Type-1. However, there are exceptions, and for some languages the phrase
grammar is Type-0 (Turing-complete).
In some
languages like Perl and Lisp the specification (or implementation) of the
language allows constructs that execute during the parsing phase. Furthermore,
these languages have constructs that allow the programmer to alter the behavior
of the parser. This combination effectively blurs the distinction between
parsing and execution, and makes syntax analysis an undecidable problem in these
languages, meaning that the parsing phase may not finish. For example, in Perl
it is possible to execute code during parsing using a BEGIN statement,
and Perl function prototypes may alter the syntactic interpretation, and
possibly even the syntactic validity of the remaining code.[3] Colloquially this is referred to as
"only Perl can parse Perl" (because code must be executed during
parsing, and can modify the grammar), or more strongly "even Perl cannot
parse Perl" (because it is undecidable). Similarly, Lisp macros introduced
by the defmacro syntax also
execute during parsing, meaning that a Lisp compiler must have an entire Lisp
run-time system present. In contrast C macros are merely string replacements,
and do not require code execution.[4][5]
Syntax versus semantics
The syntax
of a language describes the form of a valid program, but does not provide any
information about the meaning of the program or the results of executing that
program. The meaning given to a combination of symbols is handled by semantics
(either formal or hard-coded in a reference implementation). Not all syntactically correct programs are
semantically correct. Many syntactically correct programs are nonetheless
ill-formed, per the language's rules; and may (depending on the language
specification and the soundness of the implementation) result in an error on
translation or execution. In some cases, such programs may exhibit undefined behavior. Even when
a program is well-defined within a language, it may still have a meaning that
is not intended by the person who wrote it.
Using natural language as an
example, it may not be possible to assign a meaning to a grammatically correct
sentence or the sentence may be false:
- "Colorless green ideas sleep furiously." is grammatically well formed but has no generally accepted meaning.
- "John is a married bachelor." is grammatically well formed but expresses a meaning that cannot be true.
The
following C language fragment is syntactically correct, but performs an
operation that is not semantically defined (because p is a null pointer, the operations p->real and p->im have no
meaning):
complex *p = NULL;
complex abs_p = sqrt (p->real * p->real
+ p->im * p->im);
More simply:
int x;
printf("%d", x);
Syntax is the body of rules that govern sentence
structure and formation in a language. A sentence is usually defined as a
grammatical unit that is composed of one or more clauses that expresses a
thought. It also may be composed of elliptical constructions, a single word or
short phrase that has meaning within a given context.
Syntax primarily deals with phrases, their structure and how they are assembled together. The basic building block phrases are noun phrases, which have a noun as a head word, and verb phrases, which have a verb as a head word. These may contain adjectival and adverbial units and prepositional phrases, as well as dependent, subordinate phrases.
Here is an example sentence:
'The clever young girl adeptly tripped the burglar in the hallway before he could turn on the light.'
The noun phrase is 'the clever young girl,' while the verb phrase is 'adeptly tripped the burglar'. The adjectival unit is contained within the noun phrase 'clever young'. The adverbial unit is contained within the verb phrase 'adeptly'. The prepositional phrase is 'in the hallway'. The subordinate clause is 'before he could turn on the light'.
Syntax primarily deals with phrases, their structure and how they are assembled together. The basic building block phrases are noun phrases, which have a noun as a head word, and verb phrases, which have a verb as a head word. These may contain adjectival and adverbial units and prepositional phrases, as well as dependent, subordinate phrases.
Here is an example sentence:
'The clever young girl adeptly tripped the burglar in the hallway before he could turn on the light.'
The noun phrase is 'the clever young girl,' while the verb phrase is 'adeptly tripped the burglar'. The adjectival unit is contained within the noun phrase 'clever young'. The adverbial unit is contained within the verb phrase 'adeptly'. The prepositional phrase is 'in the hallway'. The subordinate clause is 'before he could turn on the light'.