No announcement yet.

SANDBOX-0013 html parser skeleton

  • Filter
  • Time
  • Show
Clear All
new posts

  • SANDBOX-0013 html parser skeleton

    ID: SANDBOX-0013
    STATUS: 30% complete (initial prototype)
    AUTHOR: Anna Ceguerra
    DATE: 19 January 2012
    Copyright © 2012 Anna Ceguerra
    This is an example html parser. It currently works on Picaso only as it is dependent on the native FAT16 file functions on Picaso chips.
    There are two versions: a simple parser that only considers 4 keywords, and a fuller definition, which contains skeletons of keyword handler functions only.
    The simple parser is less than 300 lines of code, as it is an event-driven parser. An event-driven parser generates events based on what was only just seen. The state machine for this is actually quite simple, which is why the code is so compact.
    The parser contains functions to recognise escape characters, a basic tokeniser, and a function to read attribute values. It also contains the handlers for each keyword. The main "parsehtml" function is the function that contains the main loop that reads each event.
    The tokeniser first tries to match the escape characters, then the keyword. If it finds a " <", it sets tokeniser.f to BEGIN_TAG constant, and so on. If the previous token was a keyword, then tokeniser.f is set to SET_ATTRIBUTE.

    Attached are
    1. simple parser
    2. full skeleton parser, handles more keywords

    The example html file on which the simple & full-skeleton parsers were tested looks like this:
    <body bgcolor=white>blabla</body>

    NOTE: for the full-featured parser you probably need a stack to keep track of the attribute values for nested tags