Parser¶
The parser pipeline is split into a lexer, control word dispatch, and explicit parser state. The lexer emits structural tokens, while parser state owns style stacks, destinations, tables, lists, images, metadata, notes, diagnostics, and output buffering.
Use parse_rtf() for bytes or strings and read_rtf() for files.
- rtfstruct.parse_rtf(data, options=None)[source]¶
Parse RTF data into a structured document AST.
- Parameters:
data (bytes | str) – RTF input as bytes or text. Bytes are decoded as Latin-1 for the skeleton reader so source bytes are preserved one-to-one.
options (ParserOptions | None) – Optional parser configuration.
- Returns:
A Document AST containing parsed blocks and diagnostics.
- Raises:
RtfSyntaxError – Raised when input is not recognisably RTF and recovery is disabled.
- Return type:
- rtfstruct.read_rtf(path, options=None)[source]¶
Read an RTF file and parse it into a structured document AST.
- Parameters:
path (str | Path) – File path to read.
options (ParserOptions | None) – Optional parser configuration.
- Returns:
Parsed Document AST.
- Return type:
- class rtfstruct.ParserOptions(recover=True, preserve_unknown_destinations=False, extract_images=True, track_spans=False, max_group_depth=1000, max_document_chars=100000000, max_diagnostics=10000)[source]¶
Parser configuration.
- Parameters:
recover (bool)
preserve_unknown_destinations (bool)
extract_images (bool)
track_spans (bool)
max_group_depth (int)
max_document_chars (int)
max_diagnostics (int)
- recover¶
Whether recoverable malformed input should produce diagnostics instead of raising.
- Type:
bool
- preserve_unknown_destinations¶
Whether readable unknown destinations may be preserved later as raw AST payloads.
- Type:
bool
- extract_images¶
Whether image payload extraction is requested.
- Type:
bool
- track_spans¶
Whether source spans should be attached where practical.
- Type:
bool
- max_group_depth¶
Maximum allowed RTF group nesting depth.
- Type:
int
- max_document_chars¶
Maximum emitted document characters.
- Type:
int
- max_diagnostics¶
Maximum diagnostics retained on the document.
- Type:
int
Internal Pipeline¶
RTF lexer.
This module converts RTF text into a stream of tokens with source offsets. It does not interpret control word semantics or build AST nodes. Semantic handling belongs to reader.py and later control_words.py.
- class rtfstruct.lexer.RtfLexer(text)[source]¶
Iterative lexer for RTF input.
The lexer avoids recursion and yields group, control, hex, text, and EOF tokens. It intentionally leaves malformed semantic recovery to the parser, while keeping tokenization predictable for large inputs.
- Parameters:
text (str)
Mutable parser state for the RTF reader.
This module owns the compact state object used by the parser state machine. It does not tokenize input or decide control-word semantics; lexer tokens come from lexer.py, and semantic dispatch lives in control_words.py.
- class rtfstruct.parser_state.AnnotationContext(id, blocks, parent_output)[source]¶
Parser-local state for an active annotation destination.
- Parameters:
id (str)
parent_output (OutputContext)
- class rtfstruct.parser_state.FieldContext(result_start_index, instruction_parts=<factory>)[source]¶
Parser-local state for an active RTF field.
- Parameters:
result_start_index (int)
instruction_parts (list[str])
- property instruction: str¶
Return the accumulated field instruction.
- class rtfstruct.parser_state.FootnoteContext(id, blocks, parent_output)[source]¶
Parser-local state for an active footnote destination.
- Parameters:
id (str)
parent_output (OutputContext)
- class rtfstruct.parser_state.ImageContext(id, content_type=None, width_twips=None, height_twips=None, goal_width_twips=None, goal_height_twips=None, scale_x=None, scale_y=None, hex_parts=<factory>)[source]¶
Parser-local state for an active RTF picture destination.
- Parameters:
id (str)
content_type (str | None)
width_twips (int | None)
height_twips (int | None)
goal_width_twips (int | None)
goal_height_twips (int | None)
scale_x (int | None)
scale_y (int | None)
hex_parts (list[str])
- class rtfstruct.parser_state.OutputContext(current_inlines, text_parts, paragraph_style_for_current, paragraph_span_start, paragraph_span_end, text_span_start, text_span_end, active_blocks, force_text_run_boundary)[source]¶
Parser output buffers to restore after nested destinations.
- Parameters:
current_inlines (list[TextRun | Link | Field | FootnoteRef | AnnotationRef | ImageInline | LineBreak | Tab])
text_parts (list[str])
paragraph_style_for_current (ParagraphStyle | None)
paragraph_span_start (int | None)
paragraph_span_end (int | None)
text_span_start (int | None)
text_span_end (int | None)
force_text_run_boundary (bool)
- class rtfstruct.parser_state.ParserState(options, diagnostics, style_interner=<factory>, current_style=<factory>, style_stack=<factory>, current_paragraph_style=<factory>, paragraph_style_stack=<factory>, paragraph_style_for_current=None, blocks=<factory>, footnotes=<factory>, footnote_stack=<factory>, next_footnote_number=1, annotations=<factory>, annotation_stack=<factory>, next_annotation_number=1, images=<factory>, image_stack=<factory>, next_image_number=1, metadata=<factory>, current_metadata_key=None, current_metadata_parts=<factory>, table_builder=None, table_parent_output=None, table_row_active=False, list_definitions_ordered=<factory>, list_overrides=<factory>, current_list_id=None, current_list_ordered=False, current_override_list_id=None, current_override_number=None, current_inlines=<factory>, text_parts=<factory>, text_span_start=None, text_span_end=None, paragraph_span_start=None, paragraph_span_end=None, skip_depth=0, current_destination=Destination.NORMAL, destination_stack=<factory>, ansi_codepage=1252, font_table=<factory>, font_charsets=<factory>, color_table=<factory>, current_font_id=None, current_font_charset=None, active_font_charset=None, current_font_name_parts=<factory>, current_color_red=None, current_color_green=None, current_color_blue=None, field_stack=<factory>, force_text_run_boundary=False, unicode_skip_bytes=1, fallback_chars_to_skip=0, emitted_chars=0)[source]¶
Mutable state for the initial RTF parser state machine.
- Parameters:
options (ParserOptions)
diagnostics (Diagnostics)
style_interner (TextStyleInterner)
current_style (TextStyle)
style_stack (list[TextStyle])
current_paragraph_style (ParagraphStyle)
paragraph_style_stack (list[ParagraphStyle])
paragraph_style_for_current (ParagraphStyle | None)
footnotes (dict[str, Footnote])
footnote_stack (list[FootnoteContext])
next_footnote_number (int)
annotations (dict[str, Annotation])
annotation_stack (list[AnnotationContext])
next_annotation_number (int)
images (dict[str, Image])
image_stack (list[ImageContext])
next_image_number (int)
metadata (Metadata)
current_metadata_key (str | None)
current_metadata_parts (list[str])
table_builder (TableBuilder | None)
table_parent_output (OutputContext | None)
table_row_active (bool)
list_definitions_ordered (dict[int, bool])
list_overrides (dict[int, int])
current_list_id (int | None)
current_list_ordered (bool)
current_override_list_id (int | None)
current_override_number (int | None)
current_inlines (list[TextRun | Link | Field | FootnoteRef | AnnotationRef | ImageInline | LineBreak | Tab])
text_parts (list[str])
text_span_start (int | None)
text_span_end (int | None)
paragraph_span_start (int | None)
paragraph_span_end (int | None)
skip_depth (int)
current_destination (Destination)
destination_stack (list[Destination])
ansi_codepage (int)
font_table (dict[int, str])
font_charsets (dict[int, int])
color_table (list[Color | None])
current_font_id (int | None)
current_font_charset (int | None)
active_font_charset (int | None)
current_font_name_parts (list[str])
current_color_red (int | None)
current_color_green (int | None)
current_color_blue (int | None)
field_stack (list[FieldContext])
force_text_run_boundary (bool)
unicode_skip_bytes (int)
fallback_chars_to_skip (int)
emitted_chars (int)
- options¶
Parser configuration.
- diagnostics¶
Capped diagnostic collector.
- style_interner¶
Shared TextStyle cache.
- current_style¶
Active inline style.
- Type:
- style_stack¶
Group-scoped style stack.
- Type:
list[rtfstruct.ast.TextStyle]
- current_paragraph_style¶
Active paragraph style.
- paragraph_style_stack¶
Group-scoped paragraph style stack.
- Type:
- paragraph_style_for_current¶
Paragraph style snapshot for active content.
- Type:
rtfstruct.ast.ParagraphStyle | None
- blocks¶
Emitted top-level paragraph blocks.
- Type:
list[rtfstruct.ast.Paragraph | rtfstruct.ast.ListBlock | rtfstruct.ast.Table]
- active_blocks¶
Blocks receiving finished paragraphs for the active output.
- Type:
list[rtfstruct.ast.Paragraph | rtfstruct.ast.ListBlock | rtfstruct.ast.Table]
- current_inlines¶
Inline nodes for the active paragraph.
- text_parts¶
Buffered plain text for the active run.
- Type:
list[str]
- skip_depth¶
Current unsupported destination skip depth.
- Type:
int
- current_destination¶
Active RTF destination.
- Type:
rtfstruct.destinations.Destination
- destination_stack¶
Group-scoped destination stack.
- Type:
list[rtfstruct.destinations.Destination]
- font_table¶
Parsed font table keyed by RTF font number.
- Type:
dict[int, str]
- color_table¶
Parsed color table indexed by RTF color number.
- Type:
list[rtfstruct.ast.Color | None]
- field_stack¶
Active field contexts.
- Type:
- unicode_skip_bytes¶
Current ucN fallback length.
- Type:
int
- fallback_chars_to_skip¶
Remaining fallback characters after uN.
- Type:
int
- emitted_chars¶
Count of emitted document characters for safety limits.
- Type:
int
- add_image_hex_payload(hex_text)[source]¶
Append image payload hex text.
- Parameters:
hex_text (str)
- Return type:
None
- add_inline(inline)[source]¶
Flush pending text and append a non-text inline node.
- Parameters:
inline (TextRun | Link | Field | FootnoteRef | AnnotationRef | ImageInline | LineBreak | Tab)
- Return type:
None
- add_table_cell_boundary(boundary_twips)[source]¶
Record a table cell right-edge boundary.
- Parameters:
boundary_twips (int)
- Return type:
None
- add_text(text, start=None, end=None)[source]¶
Append text after applying Unicode fallback skipping.
- Parameters:
text (str) – Text to append to the active text buffer.
start (int | None) – Optional source start offset.
end (int | None) – Optional source end offset.
- Return type:
None
- apply_background_color(index, control_word)[source]¶
Apply a color-table entry as the current background/highlight color.
- Parameters:
index (int)
control_word (str)
- Return type:
None
- apply_foreground_color(index)[source]¶
Apply a color-table entry as the current foreground color.
- Parameters:
index (int)
- Return type:
None
- decode_hex_char(hex_text, *, offset)[source]¶
Decode one hex escape using the active codepage.
- Parameters:
hex_text (str)
offset (int)
- Return type:
str | None
- finalize_destination()[source]¶
Finalize destination-local buffered state before leaving a group.
- Return type:
None
- finalize_open_contexts()[source]¶
Recover open destinations at EOF and restore output targets.
- Return type:
None
- finalize_pending_table()[source]¶
Emit a pending table after its rows have been collected.
- Return type:
None
- finish_table_row()[source]¶
Finish the active table row and restore parent output buffers.
- Return type:
None
- list_ordering_by_override()[source]¶
Return ordered/unordered status keyed by paragraph list identity.
- Return type:
dict[int, bool]
- mark_table_horizontal_merge_continuation()[source]¶
Mark the next table cell as horizontally merged with the left cell.
- Return type:
None
- mark_table_horizontal_merge_start()[source]¶
Mark the next table cell as a horizontal merge anchor.
- Return type:
None
- mark_table_vertical_merge_continuation()[source]¶
Mark the next table cell as vertically merged with the cell above.
- Return type:
None
- mark_table_vertical_merge_start()[source]¶
Mark the next table cell as a vertical merge anchor.
- Return type:
None
- pop_group(token)[source]¶
Pop group-scoped state.
- Parameters:
token (Token) – Group-end token that caused the pop.
- Return type:
None
- push_group(token)[source]¶
Push group-scoped state.
- Parameters:
token (Token) – Group-start token that caused the push.
- Return type:
None
- reset_paragraph_style()[source]¶
Reset active paragraph formatting for subsequent text.
- Return type:
None
- set_ansi_codepage(codepage)[source]¶
Set the document ANSI codepage used for hex decoding.
- Parameters:
codepage (int)
- Return type:
None
- set_color_component(component, value)[source]¶
Set a component of the current color-table entry.
- Parameters:
component (str)
value (int)
- Return type:
None
- set_current_font_charset(charset)[source]¶
Set the charset for the active font-table entry.
- Parameters:
charset (int)
- Return type:
None
- set_current_font_id(font_id)[source]¶
Set the active font number for table parsing or text styling.
- Parameters:
font_id (int)
- Return type:
None
- set_current_list_id(list_id)[source]¶
Set the active list-definition identifier.
- Parameters:
list_id (int)
- Return type:
None
- set_current_list_level_kind(level_kind)[source]¶
Record whether a list level is ordered.
RTF levelnfc23 is the common bullet marker. Numbering formats such as decimal, roman, and alphabetic are treated as ordered for this initial pass.
- Parameters:
level_kind (int)
- Return type:
None
- set_current_list_override_number(number)[source]¶
Set the paragraph-facing list override number (lsN).
- Parameters:
number (int)
- Return type:
None
- set_destination(destination)[source]¶
Switch to an RTF destination.
- Parameters:
destination (Destination)
- Return type:
None
- set_image_content_type(content_type)[source]¶
Set the active image content type.
- Parameters:
content_type (str)
- Return type:
None
- set_image_dimension(field_name, value)[source]¶
Set an active image dimension or scale field.
- Parameters:
field_name (str)
value (int)
- Return type:
None
- set_paragraph_style(**changes)[source]¶
Apply changes to the active paragraph style.
- Parameters:
changes (object)
- Return type:
None
- set_style(**changes)[source]¶
Apply style changes to the current interned inline style.
- Parameters:
changes (object)
- Return type:
None
- start_annotation()[source]¶
Start an annotation destination and emit a reference in the main flow.
- Return type:
None
- start_footnote()[source]¶
Start a footnote destination and emit a reference in the main flow.
- Return type:
None
- start_list_override_table()[source]¶
Start parsing a list override table destination.
- Return type:
None
Control-word dispatch for the RTF parser.
This module maps lexer tokens to parser-state mutations. The dispatch table is small for Milestone 1 but intentionally explicit so each supported RTF control word can gain focused tests as behavior is ported from the C++ reference.
- rtfstruct.control_words.append_unicode_value(state, token)[source]¶
Append a Unicode control-word value and skip fallback text.
The C++ reference treats negative u values as signed 16-bit code units. This implementation preserves that early behavior while recording invalid scalar values as diagnostics.
- Parameters:
state (ParserState)
token (Token)
- Return type:
None
- rtfstruct.control_words.handle_control_symbol(state, token)[source]¶
Handle non-alphabetic RTF control symbols.
- Parameters:
state (ParserState) – Active parser state.
token (Token) – Control-symbol token.
- Return type:
None
- rtfstruct.control_words.handle_control_word(state, token)[source]¶
Handle the Milestone 1 subset of RTF control words.
- Parameters:
state (ParserState) – Active parser state.
token (Token) – Control-word token.
- Return type:
None
- rtfstruct.control_words.handle_hex_char(state, token)[source]¶
Decode a hex character using the active RTF codepage.
- Parameters:
state (ParserState) – Active parser state.
token (Token) – Hex-character token whose text is two hexadecimal digits.
- Return type:
None