Class PDFText2Markdown
java.lang.Object
org.apache.pdfbox.contentstream.PDFStreamEngine
org.apache.pdfbox.text.PDFTextStripper
org.apache.pdfbox.tools.PDFText2Markdown
public class PDFText2Markdown
extends org.apache.pdfbox.text.PDFTextStripper
Convert PDF text to Markdown format. Each line in the PDF is converted to a corresponding
Markdown paragraph. Bold and italic formatting is also applied based on font properties.
- Author:
- Saurav Rawat
-
Field Summary
Fields inherited from class org.apache.pdfbox.text.PDFTextStripper
charactersByArticle, document, LINE_SEPARATOR, output -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected floatcomputeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0) protected voidWrite out the article separator.protected voidshowGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4) protected voidstartArticle(boolean isLTR) Write out the article separator with proper text direction information.protected voidWrites the Markdown paragraph end to the output.protected voidwriteString(String chars) Write a string to the output stream and escape some Markdown characters.protected voidwriteString(String text, List<org.apache.pdfbox.text.TextPosition> textPositions) Write a string to the output stream, maintain font state, and escape some Markdown characters.Methods inherited from class org.apache.pdfbox.text.PDFTextStripper
beginMarkedContentSequence, endDocument, endMarkedContentSequence, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIgnoreContentStreamSpaceGlyphs, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPage, processPages, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIgnoreContentStreamSpaceGlyphs, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePage, writePageEnd, writePageStart, writeParagraphSeparator, writeParagraphStart, writeText, writeWordSeparatorMethods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginText, decreaseLevel, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, isShouldProcessColorOperators, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showGlyph, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
Constructor Details
-
PDFText2Markdown
Constructor.- Throws:
IOException- If there is an error during initialization.
-
-
Method Details
-
startArticle
Write out the article separator with proper text direction information.- Overrides:
startArticlein classorg.apache.pdfbox.text.PDFTextStripper- Parameters:
isLTR- true if direction of text is left to right- Throws:
IOException- If there is an error writing to the stream.
-
endArticle
Write out the article separator.- Overrides:
endArticlein classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException- If there is an error writing to the stream.
-
writeString
protected void writeString(String text, List<org.apache.pdfbox.text.TextPosition> textPositions) throws IOException Write a string to the output stream, maintain font state, and escape some Markdown characters. The font state is only preserved per word.- Overrides:
writeStringin classorg.apache.pdfbox.text.PDFTextStripper- Parameters:
text- The text to write to the stream.textPositions- The corresponding text positions.- Throws:
IOException- If there is an error writing to the stream.
-
writeString
Write a string to the output stream and escape some Markdown characters.- Overrides:
writeStringin classorg.apache.pdfbox.text.PDFTextStripper- Parameters:
chars- String to be written to the stream.- Throws:
IOException- If there is an error writing to the stream.
-
writeParagraphEnd
Writes the Markdown paragraph end to the output. Furthermore, it will also clear the font state.- Overrides:
writeParagraphEndin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
showGlyph
protected void showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4) throws IOException - Overrides:
showGlyphin classorg.apache.pdfbox.contentstream.PDFStreamEngine- Throws:
IOException
-
computeFontHeight
- Throws:
IOException
-