elm.web.document.MDDocument

class MDDocument(pages, attrs=None, remove_comments=True, text_splitter=None)[source]

ELM Markdown document

Parameters:

pages (iterable) – Iterable of strings, where each string is a page of a document.
attrs (dict, optional) – Optional dict containing metadata for the document. By default, None.
remove_comments (bool, optional) – Option remove HTML comments in Markdown text during cleaning. By default, True.
text_splitter (obj, optional) – Instance of an object that implements a split_text method. The method should take text as input (str) and return a list of text chunks. The raw pages will be passed through this splitter to create raw pages for this document. Langchain’s text splitters should work for this input. By default, None, which means the original pages input becomes the raw pages attribute.

Methods

Attributes

`FILE_EXTENSION`
`MARKDOWN_COMMENT_RE`	Regex pattern to remove HTML comments from markdown text
`WRITE_KWARGS`
`empty`	`True` if the document contains no pages.
`raw_pages`	List of (a limited count of) raw pages
`text`	Cleaned text from document

MARKDOWN_COMMENT_RE = re.compile('', re.DOTALL): Regex pattern to remove HTML comments from markdown text

property raw_pages

List of (a limited count of) raw pages

property text

Cleaned text from document

property empty

True if the document contains no pages.