Class PDFText2Markdown

java.lang.Object
org.apache.pdfbox.contentstream.PDFStreamEngine
org.apache.pdfbox.text.PDFTextStripper
org.apache.pdfbox.tools.PDFText2Markdown

public class PDFText2Markdown extends org.apache.pdfbox.text.PDFTextStripper
Convert PDF text to Markdown format. Each line in the PDF is converted to a corresponding Markdown paragraph. Bold and italic formatting is also applied based on font properties.
Author:
Saurav Rawat
  • Field Summary

    Fields inherited from class org.apache.pdfbox.text.PDFTextStripper

    charactersByArticle, document, LINE_SEPARATOR, output
  • Constructor Summary

    Constructors
    Constructor
    Description
    Constructor.
  • Method Summary

    Modifier and Type
    Method
    Description
    protected float
    computeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0)
     
    protected void
    Write out the article separator.
    protected void
    showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4)
     
    protected void
    startArticle(boolean isLTR)
    Write out the article separator with proper text direction information.
    protected void
    Writes the Markdown paragraph end to the output.
    protected void
    Write a string to the output stream and escape some Markdown characters.
    protected void
    writeString(String text, List<org.apache.pdfbox.text.TextPosition> textPositions)
    Write a string to the output stream, maintain font state, and escape some Markdown characters.

    Methods inherited from class org.apache.pdfbox.text.PDFTextStripper

    beginMarkedContentSequence, endDocument, endMarkedContentSequence, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIgnoreContentStreamSpaceGlyphs, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPage, processPages, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIgnoreContentStreamSpaceGlyphs, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePage, writePageEnd, writePageStart, writeParagraphSeparator, writeParagraphStart, writeText, writeWordSeparator

    Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine

    addOperator, applyTextAdjustment, beginText, decreaseLevel, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, isShouldProcessColorOperators, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showGlyph, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator

    Methods inherited from class Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • PDFText2Markdown

      public PDFText2Markdown() throws IOException
      Constructor.
      Throws:
      IOException - If there is an error during initialization.
  • Method Details

    • startArticle

      protected void startArticle(boolean isLTR) throws IOException
      Write out the article separator with proper text direction information.
      Overrides:
      startArticle in class org.apache.pdfbox.text.PDFTextStripper
      Parameters:
      isLTR - true if direction of text is left to right
      Throws:
      IOException - If there is an error writing to the stream.
    • endArticle

      protected void endArticle() throws IOException
      Write out the article separator.
      Overrides:
      endArticle in class org.apache.pdfbox.text.PDFTextStripper
      Throws:
      IOException - If there is an error writing to the stream.
    • writeString

      protected void writeString(String text, List<org.apache.pdfbox.text.TextPosition> textPositions) throws IOException
      Write a string to the output stream, maintain font state, and escape some Markdown characters. The font state is only preserved per word.
      Overrides:
      writeString in class org.apache.pdfbox.text.PDFTextStripper
      Parameters:
      text - The text to write to the stream.
      textPositions - The corresponding text positions.
      Throws:
      IOException - If there is an error writing to the stream.
    • writeString

      protected void writeString(String chars) throws IOException
      Write a string to the output stream and escape some Markdown characters.
      Overrides:
      writeString in class org.apache.pdfbox.text.PDFTextStripper
      Parameters:
      chars - String to be written to the stream.
      Throws:
      IOException - If there is an error writing to the stream.
    • writeParagraphEnd

      protected void writeParagraphEnd() throws IOException
      Writes the Markdown paragraph end to the output. Furthermore, it will also clear the font state.

      Overrides:
      writeParagraphEnd in class org.apache.pdfbox.text.PDFTextStripper
      Throws:
      IOException
    • showGlyph

      protected void showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4) throws IOException
      Overrides:
      showGlyph in class org.apache.pdfbox.contentstream.PDFStreamEngine
      Throws:
      IOException
    • computeFontHeight

      protected float computeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0) throws IOException
      Throws:
      IOException