Parser

The parser pipeline is split into a lexer, control word dispatch, and explicit parser state. The lexer emits structural tokens, while parser state owns style stacks, destinations, tables, lists, images, metadata, notes, diagnostics, and output buffering.

Use parse_rtf() for bytes or strings and read_rtf() for files.

rtfstruct.parse_rtf(data, options=None)[source]

Parse RTF data into a structured document AST.

Parameters:
  • data (bytes | str) – RTF input as bytes or text. Bytes are decoded as Latin-1 for the skeleton reader so source bytes are preserved one-to-one.

  • options (ParserOptions | None) – Optional parser configuration.

Returns:

A Document AST containing parsed blocks and diagnostics.

Raises:

RtfSyntaxError – Raised when input is not recognisably RTF and recovery is disabled.

Return type:

Document

rtfstruct.read_rtf(path, options=None)[source]

Read an RTF file and parse it into a structured document AST.

Parameters:
  • path (str | Path) – File path to read.

  • options (ParserOptions | None) – Optional parser configuration.

Returns:

Parsed Document AST.

Return type:

Document

class rtfstruct.ParserOptions(recover=True, preserve_unknown_destinations=False, extract_images=True, track_spans=False, max_group_depth=1000, max_document_chars=100000000, max_diagnostics=10000)[source]

Parser configuration.

Parameters:
  • recover (bool)

  • preserve_unknown_destinations (bool)

  • extract_images (bool)

  • track_spans (bool)

  • max_group_depth (int)

  • max_document_chars (int)

  • max_diagnostics (int)

recover

Whether recoverable malformed input should produce diagnostics instead of raising.

Type:

bool

preserve_unknown_destinations

Whether readable unknown destinations may be preserved later as raw AST payloads.

Type:

bool

extract_images

Whether image payload extraction is requested.

Type:

bool

track_spans

Whether source spans should be attached where practical.

Type:

bool

max_group_depth

Maximum allowed RTF group nesting depth.

Type:

int

max_document_chars

Maximum emitted document characters.

Type:

int

max_diagnostics

Maximum diagnostics retained on the document.

Type:

int

Internal Pipeline

RTF lexer.

This module converts RTF text into a stream of tokens with source offsets. It does not interpret control word semantics or build AST nodes. Semantic handling belongs to reader.py and later control_words.py.

class rtfstruct.lexer.RtfLexer(text)[source]

Iterative lexer for RTF input.

The lexer avoids recursion and yields group, control, hex, text, and EOF tokens. It intentionally leaves malformed semantic recovery to the parser, while keeping tokenization predictable for large inputs.

Parameters:

text (str)

Mutable parser state for the RTF reader.

This module owns the compact state object used by the parser state machine. It does not tokenize input or decide control-word semantics; lexer tokens come from lexer.py, and semantic dispatch lives in control_words.py.

class rtfstruct.parser_state.AnnotationContext(id, blocks, parent_output)[source]

Parser-local state for an active annotation destination.

Parameters:
class rtfstruct.parser_state.FieldContext(result_start_index, instruction_parts=<factory>)[source]

Parser-local state for an active RTF field.

Parameters:
  • result_start_index (int)

  • instruction_parts (list[str])

property instruction: str

Return the accumulated field instruction.

class rtfstruct.parser_state.FootnoteContext(id, blocks, parent_output)[source]

Parser-local state for an active footnote destination.

Parameters:
class rtfstruct.parser_state.ImageContext(id, content_type=None, width_twips=None, height_twips=None, goal_width_twips=None, goal_height_twips=None, scale_x=None, scale_y=None, hex_parts=<factory>)[source]

Parser-local state for an active RTF picture destination.

Parameters:
  • id (str)

  • content_type (str | None)

  • width_twips (int | None)

  • height_twips (int | None)

  • goal_width_twips (int | None)

  • goal_height_twips (int | None)

  • scale_x (int | None)

  • scale_y (int | None)

  • hex_parts (list[str])

class rtfstruct.parser_state.OutputContext(current_inlines, text_parts, paragraph_style_for_current, paragraph_span_start, paragraph_span_end, text_span_start, text_span_end, active_blocks, force_text_run_boundary)[source]

Parser output buffers to restore after nested destinations.

Parameters:
class rtfstruct.parser_state.ParserState(options, diagnostics, style_interner=<factory>, current_style=<factory>, style_stack=<factory>, current_paragraph_style=<factory>, paragraph_style_stack=<factory>, paragraph_style_for_current=None, blocks=<factory>, footnotes=<factory>, footnote_stack=<factory>, next_footnote_number=1, annotations=<factory>, annotation_stack=<factory>, next_annotation_number=1, images=<factory>, image_stack=<factory>, next_image_number=1, metadata=<factory>, current_metadata_key=None, current_metadata_parts=<factory>, table_builder=None, table_parent_output=None, table_row_active=False, list_definitions_ordered=<factory>, list_overrides=<factory>, current_list_id=None, current_list_ordered=False, current_override_list_id=None, current_override_number=None, current_inlines=<factory>, text_parts=<factory>, text_span_start=None, text_span_end=None, paragraph_span_start=None, paragraph_span_end=None, skip_depth=0, current_destination=Destination.NORMAL, destination_stack=<factory>, ansi_codepage=1252, font_table=<factory>, font_charsets=<factory>, color_table=<factory>, current_font_id=None, current_font_charset=None, active_font_charset=None, current_font_name_parts=<factory>, current_color_red=None, current_color_green=None, current_color_blue=None, field_stack=<factory>, force_text_run_boundary=False, unicode_skip_bytes=1, fallback_chars_to_skip=0, emitted_chars=0)[source]

Mutable state for the initial RTF parser state machine.

Parameters:
  • options (ParserOptions)

  • diagnostics (Diagnostics)

  • style_interner (TextStyleInterner)

  • current_style (TextStyle)

  • style_stack (list[TextStyle])

  • current_paragraph_style (ParagraphStyle)

  • paragraph_style_stack (list[ParagraphStyle])

  • paragraph_style_for_current (ParagraphStyle | None)

  • blocks (list[Paragraph | ListBlock | Table])

  • footnotes (dict[str, Footnote])

  • footnote_stack (list[FootnoteContext])

  • next_footnote_number (int)

  • annotations (dict[str, Annotation])

  • annotation_stack (list[AnnotationContext])

  • next_annotation_number (int)

  • images (dict[str, Image])

  • image_stack (list[ImageContext])

  • next_image_number (int)

  • metadata (Metadata)

  • current_metadata_key (str | None)

  • current_metadata_parts (list[str])

  • table_builder (TableBuilder | None)

  • table_parent_output (OutputContext | None)

  • table_row_active (bool)

  • list_definitions_ordered (dict[int, bool])

  • list_overrides (dict[int, int])

  • current_list_id (int | None)

  • current_list_ordered (bool)

  • current_override_list_id (int | None)

  • current_override_number (int | None)

  • current_inlines (list[TextRun | Link | Field | FootnoteRef | AnnotationRef | ImageInline | LineBreak | Tab])

  • text_parts (list[str])

  • text_span_start (int | None)

  • text_span_end (int | None)

  • paragraph_span_start (int | None)

  • paragraph_span_end (int | None)

  • skip_depth (int)

  • current_destination (Destination)

  • destination_stack (list[Destination])

  • ansi_codepage (int)

  • font_table (dict[int, str])

  • font_charsets (dict[int, int])

  • color_table (list[Color | None])

  • current_font_id (int | None)

  • current_font_charset (int | None)

  • active_font_charset (int | None)

  • current_font_name_parts (list[str])

  • current_color_red (int | None)

  • current_color_green (int | None)

  • current_color_blue (int | None)

  • field_stack (list[FieldContext])

  • force_text_run_boundary (bool)

  • unicode_skip_bytes (int)

  • fallback_chars_to_skip (int)

  • emitted_chars (int)

options

Parser configuration.

Type:

rtfstruct.options.ParserOptions

diagnostics

Capped diagnostic collector.

Type:

rtfstruct.diagnostics.Diagnostics

style_interner

Shared TextStyle cache.

Type:

rtfstruct.ast.TextStyleInterner

current_style

Active inline style.

Type:

rtfstruct.ast.TextStyle

style_stack

Group-scoped style stack.

Type:

list[rtfstruct.ast.TextStyle]

current_paragraph_style

Active paragraph style.

Type:

rtfstruct.ast.ParagraphStyle

paragraph_style_stack

Group-scoped paragraph style stack.

Type:

list[rtfstruct.ast.ParagraphStyle]

paragraph_style_for_current

Paragraph style snapshot for active content.

Type:

rtfstruct.ast.ParagraphStyle | None

blocks

Emitted top-level paragraph blocks.

Type:

list[rtfstruct.ast.Paragraph | rtfstruct.ast.ListBlock | rtfstruct.ast.Table]

active_blocks

Blocks receiving finished paragraphs for the active output.

Type:

list[rtfstruct.ast.Paragraph | rtfstruct.ast.ListBlock | rtfstruct.ast.Table]

current_inlines

Inline nodes for the active paragraph.

Type:

list[rtfstruct.ast.TextRun | rtfstruct.ast.Link | rtfstruct.ast.Field | rtfstruct.ast.FootnoteRef | rtfstruct.ast.AnnotationRef | rtfstruct.ast.ImageInline | rtfstruct.ast.LineBreak | rtfstruct.ast.Tab]

text_parts

Buffered plain text for the active run.

Type:

list[str]

skip_depth

Current unsupported destination skip depth.

Type:

int

current_destination

Active RTF destination.

Type:

rtfstruct.destinations.Destination

destination_stack

Group-scoped destination stack.

Type:

list[rtfstruct.destinations.Destination]

font_table

Parsed font table keyed by RTF font number.

Type:

dict[int, str]

color_table

Parsed color table indexed by RTF color number.

Type:

list[rtfstruct.ast.Color | None]

field_stack

Active field contexts.

Type:

list[rtfstruct.parser_state.FieldContext]

unicode_skip_bytes

Current ucN fallback length.

Type:

int

fallback_chars_to_skip

Remaining fallback characters after uN.

Type:

int

emitted_chars

Count of emitted document characters for safety limits.

Type:

int

add_image_hex_payload(hex_text)[source]

Append image payload hex text.

Parameters:

hex_text (str)

Return type:

None

add_inline(inline)[source]

Flush pending text and append a non-text inline node.

Parameters:

inline (TextRun | Link | Field | FootnoteRef | AnnotationRef | ImageInline | LineBreak | Tab)

Return type:

None

add_table_cell_boundary(boundary_twips)[source]

Record a table cell right-edge boundary.

Parameters:

boundary_twips (int)

Return type:

None

add_text(text, start=None, end=None)[source]

Append text after applying Unicode fallback skipping.

Parameters:
  • text (str) – Text to append to the active text buffer.

  • start (int | None) – Optional source start offset.

  • end (int | None) – Optional source end offset.

Return type:

None

apply_background_color(index, control_word)[source]

Apply a color-table entry as the current background/highlight color.

Parameters:
  • index (int)

  • control_word (str)

Return type:

None

apply_foreground_color(index)[source]

Apply a color-table entry as the current foreground color.

Parameters:

index (int)

Return type:

None

decode_hex_char(hex_text, *, offset)[source]

Decode one hex escape using the active codepage.

Parameters:
  • hex_text (str)

  • offset (int)

Return type:

str | None

finalize_destination()[source]

Finalize destination-local buffered state before leaving a group.

Return type:

None

finalize_open_contexts()[source]

Recover open destinations at EOF and restore output targets.

Return type:

None

finalize_pending_table()[source]

Emit a pending table after its rows have been collected.

Return type:

None

finish_paragraph()[source]

Finish the active paragraph if it has content.

Return type:

None

finish_table_cell()[source]

Finish the active table cell.

Return type:

None

finish_table_row()[source]

Finish the active table row and restore parent output buffers.

Return type:

None

flush_text()[source]

Flush buffered text to the active paragraph.

Return type:

None

list_ordering_by_override()[source]

Return ordered/unordered status keyed by paragraph list identity.

Return type:

dict[int, bool]

mark_table_horizontal_merge_continuation()[source]

Mark the next table cell as horizontally merged with the left cell.

Return type:

None

mark_table_horizontal_merge_start()[source]

Mark the next table cell as a horizontal merge anchor.

Return type:

None

mark_table_vertical_merge_continuation()[source]

Mark the next table cell as vertically merged with the cell above.

Return type:

None

mark_table_vertical_merge_start()[source]

Mark the next table cell as a vertical merge anchor.

Return type:

None

pop_group(token)[source]

Pop group-scoped state.

Parameters:

token (Token) – Group-end token that caused the pop.

Return type:

None

push_group(token)[source]

Push group-scoped state.

Parameters:

token (Token) – Group-start token that caused the push.

Return type:

None

reset_paragraph_style()[source]

Reset active paragraph formatting for subsequent text.

Return type:

None

reset_style()[source]

Reset the current inline style to the default style.

Return type:

None

set_ansi_codepage(codepage)[source]

Set the document ANSI codepage used for hex decoding.

Parameters:

codepage (int)

Return type:

None

set_color_component(component, value)[source]

Set a component of the current color-table entry.

Parameters:
  • component (str)

  • value (int)

Return type:

None

set_current_font_charset(charset)[source]

Set the charset for the active font-table entry.

Parameters:

charset (int)

Return type:

None

set_current_font_id(font_id)[source]

Set the active font number for table parsing or text styling.

Parameters:

font_id (int)

Return type:

None

set_current_list_id(list_id)[source]

Set the active list-definition identifier.

Parameters:

list_id (int)

Return type:

None

set_current_list_level_kind(level_kind)[source]

Record whether a list level is ordered.

RTF levelnfc23 is the common bullet marker. Numbering formats such as decimal, roman, and alphabetic are treated as ordered for this initial pass.

Parameters:

level_kind (int)

Return type:

None

set_current_list_override_number(number)[source]

Set the paragraph-facing list override number (lsN).

Parameters:

number (int)

Return type:

None

set_destination(destination)[source]

Switch to an RTF destination.

Parameters:

destination (Destination)

Return type:

None

set_image_content_type(content_type)[source]

Set the active image content type.

Parameters:

content_type (str)

Return type:

None

set_image_dimension(field_name, value)[source]

Set an active image dimension or scale field.

Parameters:
  • field_name (str)

  • value (int)

Return type:

None

set_paragraph_style(**changes)[source]

Apply changes to the active paragraph style.

Parameters:

changes (object)

Return type:

None

set_style(**changes)[source]

Apply style changes to the current interned inline style.

Parameters:

changes (object)

Return type:

None

start_annotation()[source]

Start an annotation destination and emit a reference in the main flow.

Return type:

None

start_field()[source]

Start collecting an RTF field.

Return type:

None

start_field_instruction()[source]

Switch to field instruction collection.

Return type:

None

start_field_result()[source]

Switch to field result emission.

Return type:

None

start_footnote()[source]

Start a footnote destination and emit a reference in the main flow.

Return type:

None

start_image()[source]

Start collecting an RTF picture destination.

Return type:

None

start_list_definition()[source]

Start parsing one list definition.

Return type:

None

start_list_level()[source]

Start parsing one list level definition.

Return type:

None

start_list_override()[source]

Start parsing one list override.

Return type:

None

start_list_override_table()[source]

Start parsing a list override table destination.

Return type:

None

start_list_table()[source]

Start parsing an RTF list table destination.

Return type:

None

start_metadata(key)[source]

Start collecting a document metadata field.

Parameters:

key (str)

Return type:

None

start_table_row()[source]

Start collecting a table row.

Return type:

None

Control-word dispatch for the RTF parser.

This module maps lexer tokens to parser-state mutations. The dispatch table is small for Milestone 1 but intentionally explicit so each supported RTF control word can gain focused tests as behavior is ported from the C++ reference.

rtfstruct.control_words.append_unicode_value(state, token)[source]

Append a Unicode control-word value and skip fallback text.

The C++ reference treats negative u values as signed 16-bit code units. This implementation preserves that early behavior while recording invalid scalar values as diagnostics.

Parameters:
Return type:

None

rtfstruct.control_words.handle_control_symbol(state, token)[source]

Handle non-alphabetic RTF control symbols.

Parameters:
  • state (ParserState) – Active parser state.

  • token (Token) – Control-symbol token.

Return type:

None

rtfstruct.control_words.handle_control_word(state, token)[source]

Handle the Milestone 1 subset of RTF control words.

Parameters:
  • state (ParserState) – Active parser state.

  • token (Token) – Control-word token.

Return type:

None

rtfstruct.control_words.handle_hex_char(state, token)[source]

Decode a hex character using the active RTF codepage.

Parameters:
  • state (ParserState) – Active parser state.

  • token (Token) – Hex-character token whose text is two hexadecimal digits.

Return type:

None