The release of Reader-LM-0.5B and Reader-LM-1.5B by Jina AI represents a significant advancement in small language model (SLM) technology. These models are specifically designed to tackle the challenge of converting raw, noisy HTML from the open web into clean markdown format, a task that presents a multitude of complexities due to the abundance of noise in modern web content such as headers, footers, and sidebars. The objective of the Reader-LM series is to efficiently address these challenges while prioritizing cost-effectiveness and performance.
Background and Purpose
Jina AI launched Jina Reader in April 2024, an API that transforms any URL into markdown suitable for large language models (LLMs). This API relies on tools like Mozilla’s Readability package for extracting main webpage content followed by regex and the Turndown library for converting cleaned HTML into markdown. However, this approach encountered challenges such as inaccurate content filtering and complexities in converting intricate HTML structures. Based on user feedback, Jina AI recognized that enhancing the existing pipeline with additional regex patterns was not a sustainable solution.
To overcome these limitations, Jina AI posed a critical question: Can this problem be solved end-to-end using a language model? Instead of depending on manual rules creation, could a language model manage HTML-to-markdown conversion more effectively especially with less than one billion parameters?
Introduction of Reader-LM Models
Jina AI introduced two small language models: Reader-LM-0.5B and Reader-LM-1.5B. These models are specifically trained to convert raw HTML into markdown format and both have multilingual support for up to 256K tokens context length. The capacity to handle extensive contexts is crucial as modern website HTML contains abundant noise due to elements like inline CSS, JavaScript etc.
Despite being classified as small language models (< 1 billion parameters), both Reader-LMs outperform many larger lLMs when it comes to the specific task of HTML-to-markdown conversion. Architecture and Specifications
The architecture focuses primarily on selective copying from HTML to markdown without undertaking typical LLM functions like text generation or code writing.
Model Specifications:
-
- Reader-LM-0.5B: This model contains 494 million parameters with 24 layers, 896 hidden sizes along with 14 query heads.
- Reader-LM-1.5B: This larger model comprises approximately154 billion parameters supported by28 layers containing1536 hidden sizes having12 query heads.
Both models can process up to256K tokens context length which is essential for handling lengthy noisy website content found online across various languages.