PDF/HTML into EPUB

Introduction

Some things I learned while trying to convert PDFs into EPUBs for use on a e-reader.

PDF is possibly the worst format from which to get an EPUB. At best, the output is likely to still show a few oddities, at worst, some parts will simply be ureadable; This is especially true for more complexe layouts that include multi-column text, tables, or insets. The more sophisticated the layout, the worse the output.

Before even trying to convert a PDF to EPUB, check if your e-reader doesn't handle PDFs well enough, especially if it has a larger screen — although smaller screens can sometimes "reflow" PDFs to fit (check the options). Regardless of its format (EPUB or PDF), a complex document eg. with multiple columns will never work well on a small screen; For those, a wider e-reader is the realistic solution. If the stock e-reader software isn't up to the task, you can always try to massage the PDF with k2pdfopt and see if it works well enough, or even see if the KOReader application can be installed on your e-reader.

AZW(3)/KFX format used on Kindles are actually EPUB files originally from MOBI/PRC, and usually DRM protected. Use Calibre to turn those to EPUB.

As for turning web pages (HTML) into EPUB, use pandoc.

How to proceed

  1. Ideally, get an EPUB file
  2. If you're stuck with PDF and a bigger e-reader is not an option (although it's the only realistic way to read PDFs beyond a single column, no-thrill layout), open the PDF in your e-reader, if necessary playing with its "Reflow Text" and/or "Crop Margins" options to remove useless space around pages
  3. If an application can be installed, try alternative/complementary firmware from KOReader or Duokan
  4. If it still doesn't look good enough to read, run it through k2pdfopt, which will massage and create a new PDF for use on smaller e-readers
  5. Alternatively, if only some pages are garbage, investigate how to extract just those and convert them into PNG, convert them into PDF,  and then merge them all into an hybrid, text + picture PDF file; This is especially important for tables
  6. If it's too painful to read, try to convert it into an EPUB with Calibre — which relies on pdftohtml provided by the Poppler library to import text. It might be useful to start by hard-cropping the PDF to remove the useless headers + footers through redaction annotations instead of relying on Calibre's regex-based Search and Replace function (tools like "mutool trim" only hide text, they don't actually remove text from the PDF)… but don't expect miracles if page layout in the PDF is anything more complicated than a single column

    Note: Calibre keeps basic text formatting (eg. italics) by default, while Abbyy FineReader requires enabling "Retain fonts and font sizes option" in the File/Tools > Options > Format Settings > EPUB document type > Document layout > Formatted text item

    Since Calibre can have a hard time finding where chapters start, an easier solution is to split the source PDF into sub-PDFs (automate the process by first making a list of pages/ranges and feeding it to a slicer, eg. cpdf.exe input.pdf 45-48 -o pdf_45-48.pdf; Thus, one chapter = one PDF) 4) run Calibre to convert them into EPUBs (ebook-convert.exe pdf_45-48.pdf pdf_45-48.epub --enable-heuristics --no-default-epub-cover), and 5) finally join them into a single EPUB (calibre-debug.exe --run-plugin EpubMerge -- -o full.epub --author "John Doe" --title "My book" --no-titles-in-toc --no-original-toc pdf_*.epub).

    Another solution is to edit the source PDF to add bookmarks and use those to slice the PDF into sub-PDFs:
    a. cpdf.exe -add-bookmarks bookmarks.txt input.pdf -o input.BOOKMARKS.pdf
    b. Use cpdf to split the input file into multiple PDFs: cpdf.exe -split-bookmarks 1 -utf8 input.BOOKMARKS.pdf -o out%%%.pdf
    c. Remove bookmarks from all PDFs to prevent Calibre from appending "Document Outline" sections: cpdf -remove-bookmarks out003.pdf -o out003.NO.BOOKMARKS.pdf
    c. Run Calibre to turn each PDF into an EPUB: "C:\Program Files\Calibre2\ebook-convert.exe" "out001.pdf" "out001.epub" --enable-heuristics --no-default-epub-cover
    d. Finally, use Calibre's EpubMerge plug-in to join them into a single EPUB file: "C:\Program Files\Calibre2\calibre-debug.exe" --run-plugin EpubMerge -- --author "John Doe" --title "My great title" -no-titles-in-toc --no-original-toc -o full.epub out001.epub out002.epub

    Note: cpdf can prepend a table of contents; Use the right option to prevent it from adding bookmarks:
    #-toc-no-bookmark
    cpdf -table-of-contents -toc-title "My Great ToC" input.pdf -o output.pdf

  7. Yet another alternative, since e-readers usually also support HTML, is to use MuPDF (or the deadware Mobipocket Creator) to turn the PDF into HTML pages, but use the following to avoid creating one huge HTML file that your e-reader might have a hard time handling (pics are embedded as base64): mutool draw -o %d.html in.pdf . You could also turn those web pages into an EPUB with pandoc.

What are EPUBs?

An EPUB is actually a zip file that packs files in HTML, PNG, etc. Just rename the extension from EPUB to ZIP to check it out.

Infos from "EPUB and KindleGen Tutorial"

http://bbebooksthailand.com/bb-epub-kindlegen-tutorial.html
 
The reason an EPUB consists in multiple HTML files is that "eReading devices are not known to be the fastest parsers of HTML due to their limited processing power. If your eBook is one large source file, it will cause serious lag and readability issues when a reader tries to open your eBook. […] You want to make sure that your HTML files are less than about 300KB each. You can use the exact same HTML Head Section for each file.
 
eBooks actually have two separate Table of Contents: an NCX (or Meta) Table of Contents and an HTML Table of Contents. Different eReading devices utilize these two Tables of Contents in different ways.
 
Inside the EPUB package are the following files:
The HTML content of your eBook (required)
An XML file called toc.ncx which is the NCX Table of Contents (required)
An XML file called content.opf which contains exactly how the EPUB is structured, what files are in the EPUB package, and the eBooks relevant metadata (required)
An XML file called container.xml which tells the eReader where the content.opf file is located in the compressed directory structure (required)
A text file called mimetype which says that the EPUB file is an EPUB and ZIP file (required)
The cover and content images (optional)
Audio, Video, Fonts and other media (optional)
One or more CSS files (optional)
 
Important Note: All of these files are case-sensitive, which may seem unusual for Windows users. So, be careful when you are building your EPUB package.
 
Please note that there is a very specific way that you have to compress the files into the EPUB format. Unfortunately for Windows users, compressing all your files using a GUI-based compression tool like 7-Zip will cause your EPUB file to fail validation. Per the IDPF specification, it is necessary to have the mimetype file added first to the zip file, and also to have it “stored” (i.e. uncompressed). That is why you have to use the command line to build your EPUB."

Why is converting from PDF so difficult?

"PDF is a page oriented format while EPUB is a a reflowable format."

"The main problem here is that PDF is a page oriented format (it describes where to put glyphs on the page), while epub and mobi are both text-oriented formats (they leave it to the device to do the layout). So basically, you need to extract the text from the PDF, intelligently recognize the formatting, express this formatting in HTML, and then convert it to epub/mobi. By definition, this can't be "lightweight". And even "heavyweight" applications might give you bad results, without manual correction. – dirkt Jan 4 '18"

"Quite simply because there IS no "textual information" in a PDF document. A PDF document doesn't contain paragraphs, sentences, and words. All that it contains is drawing instructions of the form "draw this shape at these coordinates". A PDF document is essentially a series of instructions for drawing a picture on a sheet of paper. It's not a book." (Source)

"A PDF document is a software program containing instructions written in a restricted subset of the PostScript document description language, which is a full blown stack-based programming language. Extracting text from a PDF document is difficult because it is not stored in specific sections of the file, but scattered in difficult to predict ways among the instructions that generate the document layout." (Source)

"Trying to extract a properly formatted document from a PDF is akin to hoping to recover a full-sized image by "enhancing" a small thumbnail." (Source)

"there is no concept of text structure in a PDF file at all, no lines, no paragraphs, sentences, nothing. All there is in a PDF file is 'this text' and 'put it here on the page'.

The encoding used for the text may even be custom, and there may be no possible method (other than OCR) for determining the actual text content (eg the Unicode values).

Sentences don't even have to be contiguous." (Source: comp.lang.postscript)

You can learn by reading archives of the Calibre > Conversion forum, searching for "EPUB PDF" in the titles.

Note that a PDF/EPUB can look different on the computer using eg. SumatraPDF and on your e-reader.

PDF to EPUB

k2pdfopt ("Kindle 2 PDF Optimizer") is the first thing to try to make the PDF as readable as possible on a smaller e-reader. If NOK, try Calibre to convert PDF into EPUB.

Text PDF

If you have a recent version of MS Word, open the PDF and see if its PDF Reflow feature does the job well enough, before convering its docx file into an EPUB using Calibre, Writer2ePub, Pandoc, etc.

Issues that must be manually fixed:

Text in its own layer can be easily read and saved in a text file, but italics, notes and other formatting will be lost. pdftotext is one of the applications available.

Calibre/Poppler don't usually do a very good job turning PDF into text. LibreOffice opens PDF in Draw, but is unable to export this to Writer. Softmaker TextMaker can't open PDFs.

Alternatively, and even though the PDF already contains a text layer, is open the PDF in Abbyy FineReader, and copy/paste to LibreOffice Writer or Sigil, and fix the issues left. A CLI might be available; If not an AutoIT script is handy to automate the process.

gImageReader says "PDFs with text. These PDF files already contain text", and stops.

Try k2pdfopt

K2pdfopt ("Kindle 2 PDF Optimizer") is a cross-platform GUI/CLI application to "optimize the format of PDF (or DJVU) files for viewing on small (e.g. 6-inch) mobile reader and smartphone screens such as the Kindle's. The output from k2pdfopt is a new (optimized) PDF file." It relies on MuPDF to read PDFs, but can be configured to use Ghostscript instead. Check PDF Conversion Tips for e-readers . Unless the output looks weird, the only settings you need to set is the e-reader's width + height + DPI, and possibly crop pages to reduce useless margins. Here is the list of the commands it supports. If you don't like the odd-looking GUI, there's an alternative: k2pdfopt GUI.

Note: By default, k2pdfopt converts pages into bitmaps, even when the source file is native PDF (ie. text, not scanned text into bitmaps).

To convert a native (ie. not scanned) PDF: k2pdfopt -mode fw -ls- input.pdf ("fit width" removes the excess borders; -ls- prevents turning the document on its side). Use the "-p" switch to only include a subset of pages, eg. "-p 2,4-8,37"

If need be, you can increase the output margins with the -om command-line option, e.g. -om 0.2 will add 0.2 inches of padding around the output pages.

If a PDF look OK on the computer but doesn't on the e-reader, an alternative is to convert the whole file into bitmaps using "k2pdfopt -mode fw -ls- -n- input.pdf". Obviously, the file will be much bigger than the original.

If a PDF contains scanned pages instead of text, here's how to run it through k2pdfopt's embedded OCR program and include searchable text along with the bitmaps: k2pdfopt -mode copy -odpi 200 -ocr t -ocrlang <set language, eg. -ocrlang fra> -ocrd p input.pdf . In case of a multi-column layout, you can use the usual CTRL+mouse to select part of the screen.

Notes taken from the web site:

K2pdfopt converts each page of the input file to a bitmap, scans the bitmap for viewable areas (rectangular regions), cuts + crops these regions and assembles them into multiple smaller pages without excess margins so that the viewing region is maximized. Making use of this method, k2pdfopt can re-flow text lines, even on scanned documents.
 
As of v1.50, k2pdfopt will also embed OCR text into the PDF so that text can be searched and highlighted, and v1.60 can create output files with the native PDF instructions from the source file (if the source file is PDF).
 
K2pdfopt has the advantage over other PDF converters in that it fully preserves the rendered PDF fonts and graphics from the original file, unlike programs that convert the PDF to an e-book format. Also, because k2pdfopt is completely independent of language or fonts, it will work equally well on documents in any language.
 
MS Office offers PDF Reflow, where MS Word converts PDF files to Word documents amazingly well. Once you have your PDF file in MS Word format, you'll have a lot more capability to manipulate it into other formats and/or form factors.
 
With the default conversion, which allows text re-flow, every converted page is a bitmap, so the file size of the converted file is often larger than the original; however, many e-readers can process PDF files made up of bitmaps faster and with less memory overhead than the original PDF file, so you might still prefer this type of conversion. If you still want a smaller output file size, see my help page on output file size for options that reduce the output file size, mostly at the expense of the output quality. If you don't need text re-flow, you might try using a mode which converts using native PDF output.
 
To remove the excess borders on my PDF file, use "-mode fw" (fw = fit width). If you still want to rasterize the output, use -mode fw -n-. If you don't want to turn the document on its side, use -mode fw -ls-.
 
To crop region and put only that region in the output PDF à la Briss, use the GUI: Make one of the "Crop Areas" active (check box); type in the applicable page range for the crop box (e.g. 2-99), then click the blue Select button and choose your crop region. For the conversion mode, select Crop (command-line: -mode crop).
 
The reason a native PDF output can cause the device to run out of memory, be very slow, or even crash, is likely because of too many cropped-and-scaled regions in the output file. Try using a specific conversion mode instead. Modes are shorthand for setting a collection of options that are best suited for s specific type of optimization.
 
If there are more than one cropped/scaled regions on an output page, most PDF reading applications will get confused and allow selection of "invisible" text which is outside cropped regions and which overlaps with displayed text.
To see how k2 interpretes a PDF file, try using the -sm command-line option ("sm" from the interactive menu), which will write out a PDF file that shows the regions found by k2pdfopt.
 
As of v1.35, k2pdfopt has a nice debugging option to clearly show you how it is interpreting your PDF file by marking the regions on it in the order it chooses to display them. The command-line option -sm (show markings) does this, or you can select "sm" from the interactive menu. This will generate a file name ending in "_marked.pdf".
 
To use text re-flow, even with tables / equations / figures, try protecting those regions by drawing boxes around them.
To prevent images / figures from being split across pages, use -f2p -1, or select "bp" from the interactive menu and enter -1 for the "fit-to-page" value.
 
To remove the document headers, footers, page numbers and/or other marks near the edges of the source pages, tell k2pdfopt to ignore an arbitrarily sized border around your document. See Ignoring Borders/Headers/Footers.
 
k2 allows for searching / highlighting the text in the converted PDF file because it has OCR capability, and as of v1.60, k2pdfopt has options for native PDF output, much like Cut2Col, SoPDF, and the latest version of PaperCrop.
 
NATIVE PDF OUTPUT = zoomable, and searchable like the original with no need for hidden OCR page.
 
The defaults for the kindle are 560 x 735. Even though the kindle screen is technically 600 x 800, the useable space for PDF files is 560 x 735. The other factor that affects the size and quality of the text on the display the DPI.
 
While k2pdfopt is designed to give good results on a 6-inch reader by default, you may want to fine tune the DPI settings depending on your reader and your input file. The -idpi and -odpi settings, discussed above, control the quality (-idpi) and magnification (-odpi) of the k2pdfopt output PDF file.
 
Landscape mode (use -ls from the command line or select option (l) from the interactive settings menu) can be used to increase the text magnification at the expense of having more pages.
 
If you would like to reduce the output PDF file size, you can use the -bpc option to reduce the number of bits per color plane. The default is 4 (for 16 graylevels--the same as the kindle can display), but using -bpc 2 will reduce to 4 graylevels and reduce the PDF file size to approximately half.
 
If you want a little extra space around the text on your reading device, you can use the -om option to set the output margins (or select option (om) from the interactive settings menu in v1.16+).
 
Since v1.50, k2pdfopt can use one of two OCR engines to convert bitmapped text to native ASCII characters so that the text in the output file can be searched or copied and pasted into other applications. And in v1.63, bitmapped text from any language that Tesseract supports (including, for example, Chinese) is converted to Unicode-16 values and can be copied and pasted into Unicode-aware applications (e.g. most web browsers and modern word processing software). See the examples below.
 
Make sure you really need to perform OCR first. With k2pdfopt v2.x, if the source PDF document has searchable or highlightable text (e.g. if it is computer-generated or scanned but has an OCR layer), then k2pdfopt output of either type (native PDF or the default re-flowed text mode) should also have searchable text without having to resort to time-consuming OCR. OCR should only be necessary if the source document is scanned and does not already have a text/OCR layer.
 
the -m option (or select option (m) from the interactive settings menu in v1.16+) to tell k2pdfopt to ignore a certain amount of margin in the input file. For this particular example, 0.8 inches is a good value, so -m 0.8 should be used:
 
K2pdfopt has built-in PDF translation (via the MuPDF library) but will try to use Ghostscript if Ghostscript is available and the internal (MuPDF) translation fails. Since I fixed a couple bugs with MuPDF in v1.16, I have found no instances where MuPDF fails to correctly translate a PDF file, but you can force Ghostscript to be used with the -gs option.
 
Forum: https://www.mobileread.com/forums/showthread.php?t=144711
 
GETTING STARTED WITH THE WINDOWS GUI https://www.willus.com/k2pdfopt/help/overview.shtml
INTERACTIVE TEXT MENU https://www.willus.com/k2pdfopt/help/textmenu.shtml
LIST OF K2PDFOPT COMMAND-LINE OPTIONS https://www.willus.com/k2pdfopt/help/options.shtml

Tips

List of options.

"The modes are really just shortcuts that combine multiple individual options that, together, are well suited for a particular type of conversion. You can then tailor things further, if desired, by adding more options after the -mode command.)"

"If the entire source page fits your device when you strip away the margins, try -mode trim. This will trim away any margin areas around the text and fit it to your device screen to maximize the size of the text."

"If the width of the source material fits either the width or the height of your device and is comfortably readable, try -mode fitwidth or -mode fw. If you don't want the output in landscape, add -ls- to force portrait output.
If there is a common area on every page that you want to select which will then comfortably fit your device screen, you can use -cbox to specify this region, or use the MS Windows GUI to graphically select the region (see the "Crop Areas" part of the GUI). If the entire selected area fits onto your device with no trimming or text re-flow required to be readable, use -mode crop."

To remove excess borders: -mode fw (fw = fit width)

To set the device model: -dev kbg (for Kobo Glo)

To set the page height and width: -w 758 -h 1024

If the text only has a single column: -col 1

To set the magnification: -dpi 213

-m* are used on the intput, while -om* are used on the output

To ignore headers/footers/borders: -m* or -cbox (Important: As usual, cropped data is only hidden, not removed from the PDF). The -ml, -mr, -mb, and -mt options can also be used to more specifically set the left, right, bottom, and top margin-ignoring widths, respectively.

To add some extra space around the text, use the output margin option: -om 0.3 (or -oml, -omr, -omb, and -omt) https://www.willus.com/k2pdfopt/help/margins.shtml

Using an OCR

OCRing + EPUBing my first book: Tips?

https://www.mobileread.com/forums/showthread.php?t=331376

OCR: gImageReader (GUI to Tesseract), Abbyy FineReader

EPUB editor: LibreOffice Writer, Silig (last Win32 release: 0.9.14; How to compile)

Help

Q&A

"native PDF output"?

"rather than rendering the output file as a sequence of bitmaps, each output page is rendered directly using the source PDF file instructions, but with translation, scaling, and cropping directives to place the source regions at the appropriate places on the output pages"

In the GUI, how does "native PDF output" differ from "Re-flow text"?

Native =  -n -wrap-, Re-flow text = -wrap+

"can re-flow text even on scanned PDF files"

How does it move text in a scanned page?

My e-reader isn't listed

Use the -w (width) and -h (height) command-line options.

"text re-flow"?

-wrap vs. -wrap+?

"rasterize"?

Turn native text into a bitmap

Why is it hard converting PDF to text (eg. EPUB)?

How does the device setting ("Kobo Glo, Kindle 1-5, etc.") change?

What toolkit was used to write the GUI?

"native/bitmapped PDF"?

How to remove page numbers displayed in the middle of a page?

Per this tip, in the GUI, try adding the following to the "Addition options" section, and run a test on just one page where the problem occurs: -m 0.25,0.25,0.25,0.7

Using Calibre

  1. Open the PDF in a viewer (on Windows, SumatraPDF can read PDF and EPUB), and make a list of the pages that include anything more sophisticated than plain text:
    1. Text that is displayed in multicolumns must be turned into one column
    2. Insets must be removed, and turned into regular text
    3. Tables: Rather than trying to rewrite it as HTML, it's easier to just take a screenshot and save it as JPG/PNG to be inserted in the EPUB later; Make sure the picture is no bigger than the width+height of your e-reader, and that the picture is located in the HTML file at the top+left so that it's correctly displayed
  2. Use Calibre to generate the EPUB; If need be, play with its settings in the Conversion dialog, including the Page setup where you can tell Calibre which e-reader you have
  3. Open the output in its editor (right click > Edit book, or T), and edit the pages that need it; Pages can be removed through the Delete key, and new ones added with File > Insert; To insert an image, use the familiar <img src=""> sequence
  4. Copy EPUB to e-reader.

An easy way to fix issues with the EPUB created by Calibre is to edit the file in Sigil.

Build an hybrid PDF

As an alternative to turning a PDF into EPUB with Poppler and all its issues, there's the option of simply converting the few problematic pages (tables, etc.) into pictures, replacing+merging them back into the main PDF, and reading the PDF on my e-reader that has no problem handling basic text.
Obviously, while flipping through that kind of mixed PDF, the user can tell the difference, but IMHO it's a much better solution than the HTML output from Poppler.

k2 is unable to run once and handle pages differently, turning some pages into bitmaps (rasterize) while leaving the others as text ("native PDF"): You'd have to write a loop, and merge those two sets back into a PDF. Likewise, I haven't found how to use cpdf to crop, maximize, and rasterize pages.

Things that could be improved:

To crop:

Here's the Windows batch script:

@ECHO OFF

REM myscript.bat output.pdf input.pdf "1-5,8,25"
 
REM Note: ~ removes quotes
if "%~1"=="" GOTO PARAM
if "%~2"=="" GOTO PARAM
if "%~3"=="" GOTO PARAM
 
REM Change those to match your e-reader
SET DPI=213
SET WIDTH=758
SET HEIGHT=1024
 
IF NOT EXIST mutool.exe (ECHO mutool missing & GOTO END)
SET APP=..\mutool.exe
SET OUTPUT=%1
SET INPUT=..\%2
SET LIST=%~3
SET TMPDIR=TEMP%random%%random%%random%%random%%random%%random%TEMP
 
REM Create temp dir
IF NOT EXIST %TMPDIR% MD %TMPDIR%
CD %TMPDIR%
 
REM Convert input PDF into individual PDFs
FOR /F "tokens=* delims=" %%# IN ('%APP% show %INPUT% Root.Pages.Count') DO SET "COUNT=%%#"
ECHO Found %COUNT% pages
FOR /L %%i IN (1,1,%COUNT%) DO (ECHO Handling %%i & %APP% clean -g %INPUT% %%i.pdf %%i)
 
REM Convert required pages into PNG, and remove matching PDF
%APP% draw -r %DPI% -w %WIDTH% -h %HEIGHT% -o %%d.png %INPUT% %LIST%
REM Delete matching PDFs
FOR %%A in (*.png) DO (ECHO Deleting %%~nA.pdf & DEL %%~nA.pdf)
 
REM Convert PNG files into PDFs, and remove PNG
FOR %%A IN (*.PNG) DO (%APP% convert -O compress -F pdf -o %%~nA.pdf %%A & ECHO Deleting  %%A & DEL %%A)
 
REM Merge individual PDFs into single PDF
REM Build list
SETLOCAL EnableDelayedExpansion
SET _filelist=
FOR /F "delims=|" %%f in ('dir /b *.pdf') DO (
  SET "_filelist=!_filelist!%%f "
)
SET LIST=%_filelist:,,=%
REM ECHO LIST=%LIST%
ECHO Merging
%APP% merge -o %OUTPUT% -O compress %LIST%
 
REM Cleaning up
MV %OUTPUT% ..
CD ..
RMDIR /S /Q %TMPDIR%
GOTO END
 
:PARAM
ECHO Usage : %0 output.pdf input.pdf "pages" (Use quotes if pattern includes commas, eg. "2,3")
GOTO END
 
:END

REWRITE AS POWERSHELL OR RUBY

https://en.wikipedia.org/wiki/PowerShell

cmd.exe > powershell

$PSVersionTable

Editing the PDF

An alternative is to use LibreOffice Draw to modify the PDF, and read it in your e-reader without bothering with EPUB:

  1. If the PDF file is big and would make LibreOffice sluggish, use qpdf to export each page as an individual PDF file:

    qpdf --progress --split-pages infile.pdf %d.pdf
     
  2. In Draw, open and edit each problematic page to replace all nasty parts (remove/rewrite insets and multi-column text, replace tables with screenshots)
  3. Use qpdf to merge all the pages back into a single PDF:

    qpdf --empty --pages *.pdf -- out.pdf

A faster way is to simply turn each "problematic" page into pictures:

  1. Open PDF on computer, and make a list of the pages that contain anything more than basic, one-column text (eg. multi-columns, tables, insets, etc.)
  2. Split all the pages of the PDF into individual files

    pdfseparate.exe" input.pdf %d.pdf 
     
  3. Convert each PDF with difficult layout into pictures, matching the e-reader's width+height
  4. Merge all the PDFs back into a single PDF
  5. Send to e-reader, and test.

Q&A

Nolim: Fichiers supportés pour les livres : epub, fb2, html, txt, pdf et drm Adobe

Can qpdf convert PDF into pictures?

No.

Can MuPDF convert PDF into pictures, and merge the files back?

for N in $(seq $(mutool show input.pdf Root.Pages.Count)); do mutool clean -g input.pdf page$N.pdf $N; done

convert relevant pages into pictures

mutool merge

Investigate ImageMagick's convert

convert in.pdf -crop 50%x0 +repage out.pdf

Try TIFF or PSD vs. JPG/PNG
How to get rid of "side circles" (typographice signs)?

d:\Temp\PDF.to.EPUB\test.PDF.edit\10.jpg-1.jpg

pdfcairo vs. pdfppm?
How to crop?
pdfseparate.exe: progress bar?
How to compile Poppler for Windows32?

https://sourceforge.net/projects/poppler-win32/

https://www.anaconda.org/conda-forge/poppler/files

https://blog.alivate.com.au/tag/pdftohtml/

https://towardsdatascience.com/poppler-on-windows-179af0e50150 

Converting HTML pages into an EPUB

Multiple HTML files can be cleaned up and converted into a single EPUB file.

  1. With Calibre, you first need to create a ToC
  2. And then, call the command: "c:\Program Files\Calibre2\ebook-convert.exe" ToC.html full.epub

Open-source applications

Calibre

Calibre is a well-known cross-platform, open-source, GUI application to remove DRMs written by Kovid Goyal.. It can also convert a PDF into EPUB. It does a reasonable job at the latter, but just like other tools, it has a difficult time with more sophisticated layout, tables, and headers/footnotes.

The UI can be changed through the Preferences > Change Calibre behavior (CTRL+P) > Interface, and the Layout button in the bottom right corner:

It's also available as a CLI:

ebook-convert.exe input_file output_file [options]

Once Calibre is installed, you can run the PDF-to-EPUB converter through the command line: ebook-convert.exe input_file output_file [options].

To convert a PDF into HTML, Calibre actually relies on poppler (version). What Calibre does, is run poppler's pdftohtml to convert each page of the PDF into HTML, and then work from there and build an EPUB. The settings in Calibre's Convert dialog lets you changes the settings that it will use for this operation, but Calibre can only do so much using the input from poppler.

"Calibre is awesome at many things, but PDF conversion isn't one of its strong points. What I find most annoying is the text unwrapping, and that certainly is Calibre's fault. The algorithm it uses is quite simplistic, if a line is less than xx% of the page width, it's considered a paragraph break, if it's longer, it's not. So in a typical book, you end up with hundreds of incorrect paragraph breaks - spurious breaks that shouldn't be there, and paragraphs stuck together that shouldn't be." (Source)

How to work with the Convert dialog

The Debug option lets you see the files at the four steps: input, parsed, structure, and processed.

Slight editing can be done in the \input directory, which contains the HTML files generated by poppler. When you're done, zip the files up, add it through Edit meta information dialog, and proceed with the conversions. 

How to make the most of the Convert dialog

Start by converting the PDF to EPUB using Calibre's default settings, and see what the issues are. When clicking on the Wizard button in the Search & replace section, if an EPUB isn't available, Calibre will first convert the PDF into HTML, which explains the pause.

https://manual.calibre-ebook.com/conversion.html

Heuristic Processing
Line numbers
https://dearauthor.com/ebooks/calibre-pdfs-epub-conversion-tips/
Line Un-Wrapping Factor

Used to unwrap paragraphs. This is a scale used to determine the length at which a line should be unwrapped. Valid values are a decimal between 0 and 1. The default is 0.45, just under the median line length. Lower this value to include more text in the unwrapping. Increase to include less. You can adjust this value in the conversion settings under PDF Input.

The default setting for this is 0.45, you can set this lower to make line unwrapping more 'aggressive', but be aware that doing this may unwrap lines which shouldn't be unwrapped.

"the unwrap function looks at the median (or average, can't remember) line length, and only unwraps lines that exceed that length. That works well for a book with consistent breaks in roughly the same location for every line (OCR, pdf, many well formatted text files), but it will fail where the hard breaks are inconsistent/infrequent. Reducing the unwrap factor basically tells Calibre to look for shorter lines than the median. The fewer or more erratic the breaks the lower you need to go, sometimes all the way down to 0.05". (Source)

Page setup

Page setup: Choose a device that matches the screen size of your device

Structure detection
Table of contents
Search & replace

In the Search and replace section, use the wizard to test regexes

Regexs are applied to the HTML as produced by poppler. If an EPUB has already be generated, Calibre prompts you whether to use its HTML or to start again from the PDF.

Headers and footers must be searched and removed because they are often part of the document and they can throw off the paragraph unwrapping.

Use the Wizard in the "Search & replace" section to try regexes.

"If you are intimidated by regular expressions, many Windows users have reported that [deadware] Mobipocket Creator is a good alternative to use to do the initial pdf conversion. Use Mobipocket Creator to convert the pdf to the .mobi format, and then use Calibre to convert from mobi to your final desired format."

Q&A

Any way to tell Calibre to ignore some pages (ToC, tables, etc.)?

? PDF Input > Line un-wrapping factor = 0.45 VS. Heuristic processing > Line un-wrap factor =0.40

PDF input
EPUB output
Debug
Post-EPUB editing

If the EPUB output needs some work, you can use either Calibre's internal editor (select the book in the list > Edit Book) or Sigil (80MB; Sigil is just the editor in Calibre without all the fuss; Must restart app when changing language for UI.)

Infos

Q&A

How to hide the left-side tree list Authors, Languages, etc.?

MuPDF (mutool)

Even worse than poppler to convert PDF to HTML (one line = one <p></p>)

The HTML displays fine in a browser, but is useless to create an EPUB because it has no notions of lines and paragraphs: Each line of text is just displayed at coordinates x,y, with no indication that it belongs to a paragraph.

Note: According to Calibre author, mutool outputs "non-reflowable HTML, it is just as useless as the original PDF file."

https://www.mupdf.com/docs/manual-mutool-draw.html

mutool draw -F html -o out.%d.html in.pdf
one page = one HTML (pics embedded as base64)

mutool draw -F html -o out.html in.pdf
single HTML file, with pictures embedded as "data:image/png;base64"

Remove footer? ffirs_simmons.qxd  5/16/05  4:13 PM  Page iii

-> Copied HTML into e-reader: Took ~ one minute to open, and… unreadable (format foobared).

https://artifex.com/support/open-source/

Note: In "mutool convert", N can be used to stand for the last page

clean vs. draw vs. convert:

so both mutool draw and mutool convert do the same thing, but the interfaces are different

Functions offered by mutool:

Some examples:

mutool pages input.pdf 20

#trim introduced in 1.22

mutool trim -b mediabox -o cropped.pdf in.pdf

cpdf

https://github.com/coherentgraphics/cpdf-binaries
https://github.com/coherentgraphics/cpdfsqueeze-binaries

Written by Coherent Graphics Ltd's John Whitington, author of O'Reilly's "PDF Explained". Based on an open source library written in Caml.

Notes from cpdfmanual.pdf:

The cpdf tool has been available commercially since 2007, and is widely used in industry and government. Now we're releasing two tools for free, the main program under a special not-for-commercial-use license, and a lossless PDF squeezer under the LGPL.

When measurements are given to cpdf , they are in points (1 point = 1/72 inch). They may optionally be followed by some letters to change the measurement. The following are supported: pt Points (72 points per inch). The default. cm Centimeters, mm Millimeters, in Inches.
 
Linearized PDF is a version of the PDF format in which the data is held in a special manner to allow content to be fetched only when needed. This means viewing a multipage PDF over a slow connection is more responsive. This requires the existence of the external program cpdflin which is provided with commercial versions of cpdf.

Functions offered by cpdf:

A couple of examples:

cpdf -page-info input.pdf 25

cpdf -mediabox "0 0 424pt 600pt" input.pdf 1,25-50 -o output.pdf

pdfCropMargins

"The pdfCropMargins program is a command-line application to automatically crop the margins of PDF files."

pdfcpu

"pdfcpu is a PDF processing library written in Go supporting encryption. It provides both an API and a CLI."

Examples:

pdfcpu box add -- "media:[0 0 200 600]" input.pdf output.pdf

pdfcpu boxes list  -p 20 output.pdf

qpdf

Notes from qpdf-manual.pdf

qpdf does structural, content-preserving transformations on PDF files.
 
In QDF mode, qpdf creates PDF files in what we call QDF form. The purpose of QDF form is to make it possible to edit PDF files, with some restrictions, in an ordinary text editor.
 
A Python module called pikepdf [https://pypi.org/project/pikepdf/] provides a clean and highly functional set of Python bindings to the qpdf library. Using pikepdf, you can work with PDF files in a natural way and combine qpdf's capabilities with other functionality provided by Python's rich standard library and available modules.
 
the qpdf command-line program can produce a JSON representation of the non-content data in a PDF file. It includes a dump in JSON format of all objects in the PDF file excluding the content of streams. This JSON representation makes it very easy to look in detail at the structure of a given PDF file.

Functions offered by qpdf:

poppler

Like pdftohtml, poppler is also based on xpdf. Confusingly, poppler kept the names for the applications such as "pdftohtml", so it's hard to know it's not the original whose development was abandonned in 2006.

As of April 2020, the latest stable release is poppler-0.87.0.tar.xz, released on March 28, 2020. Note that packages for Ubuntu et al. might be out of date.

Poppler includes multiples applications:

https://blog.alivate.com.au/tag/pdftohtml/

https://towardsdatascience.com/poppler-on-windows-179af0e50150

apt-get install poppler-utils

pdftohtml

"-c : This will output in complex mode. You can't use -noframes with the complex flag."

"-noframes   generate no frames. Not supported in complex output mode."

"complex mode": One page = one HTML file + one PNG that only includes some typographical feature to center the output.

Here's how to convert a PDF that was encoded in Latin1: pdftohtml -c -s -enc Latin1 test.pdf test.html

Windows release

https://stackoverflow.com/questions/18381713/how-to-install-poppler-on-windows

http://blog.alivate.com.au/poppler-windows/
(Last Windows release is 0.68 while the current release is 0.87 released on March 28, 2020)

https://anaconda.org/conda-forge/poppler/files
(Up to date but only available for Win64)

d:\Temp\temp.Archie.SVG\inkscape\libpoppler-73.dll
d:\Temp\temp.Archie.SVG\inkscape\libpoppler-glib-8.dll 

! Source Win32 https://github.com/zotero/cross-poppler "cross-poppler compiles Poppler PDF tools for macOS (x64), Windows (x86, x64), Linux (x86, x64). This is only intended to be used for pdfinfo and pdftotext."

poppler-0.39.0-win32.zip        2016-01-07      7.3 MB https://sourceforge.net/projects/poppler-win32/

pdfium

pdfium

podofo

PDFMasher

"PDFMasher, now long abandoned and unmaintained."

pdftk

Written by Sid Steward, author of O'Reilly's "PDF Hacks". For some reason, the CLI binary is called "PDFtk Server".

To uncompress a PDF: pdftk input.pdf output output.pdf uncompress

apt-get install pdftk ghostscript

https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/

https://en.wikipedia.org/wiki/PDFtk

pdfminer.six

PDFMiner

pdftohtml

pdf2htmlEX

Pandoc

pandoc cannot convert PDF to HTML, but can turn HTML into EPUB. Written in Haskell, it's rather slow and resource-hungry so isn't great on big files.

Incidently, here's a command you can use to download a single web page and its dependencies, and turn into an EPUB:

wget -E -H -k -K -p http://www.acme.com/mypage.html
pandoc -f html -t epub -o output.epub mypage.html

pandoc can also fetch web pages directly: pandoc -f html -t epub -o output.epub https://www.fsf.org

To install on Linux: apt-get install pandoc

Notes:

You can add metadata for epub:

pandoc -f html -t epub3 --epub-metadata=metadata.xml -o output.epub input.html

"Alternatively, you could use pdftotext, save it to text, edit it into shape as well formatted markdown, and then use pandoc to convert it to epub.....i've done that several times - after a lot of practice (and some handy vim key mappings), it takes me about a day or so to convert a book with a few hundred pages."

Note: "pandoc-citeproc originated as a fork of Andrea Rossato's citeproc-hs. The pandoc-citeproc executable can be used as a filter with pandoc to resolve and format citations using a bibliography file and a CSL stylesheet."

pandoc -o output.epub  *.html
pandoc: *.html: openBinaryFile: invalid argument (Invalid argument) 

It's due to how Windows handles input. Name this batch file pandoc.cmd:

@echo off
:: Pandoc wrapper for calling it with wildcard file parameters.
:: Expands any arguments containing wildcards according to standard
:: Windows CMD.exe conventions.
setlocal EnableDelayedExpansion
set pandoc_cmd=pandoc
for %%I in (%*) do set pandoc_cmd=!pandoc_cmd! "%%~I"
!pandoc_cmd!
endlocal

Call it thusly: echo output.epub | pandoc.cmd *.html -

If it still fails, use PowerShell instead of cmd.exe, or write a script in richer language like Python, etc.

NO! copy /b *.html full.html

pandoc -o full.epub  full.html

It's a better idea to keep individual HTML files, and merge them into a single EPUB.

EPUBCheck

https://github.com/w3c/epubcheck

"The EpubCheck tool is an open-source program written in Java that checks your EPUB file for errors. Most eBook stores that utilize the EPUB format will utilize this exact same program to see if the eBook you upload for sale is valid." (Source)

Closed-source solutions

Multidoc Converter

http://multidoc-converter.com

Tried 1.6.0.0 with a 10MB single HTML with all pics embedded as base64: As displayed in SumatraPDF, as crappy as Caliber.

PDFMate (free/pro)

PDFelement (Pro)

PDFelement (Pro)

PDFelement Standard perpetual license $79

iSkysoft PDF Editor is PDFelement under a different name.

Xilisoft PDF to EPUB Converter

Xilisoft PDF to EPUB Converter $20

Abby FineReader

FineReader ; Convert PDFs to e-book formats EPUB, FB2 (Standard, Corporate). 199€; Release 14 ~500-850MB

Solid Converter

Solid Converter: PDF to EPUB? $100

A-PDF

A-PDF no converter?

MobiPocket Creator

Deadware from Mobipocket; Final release 4.2  can be found here in the "Tutorial - How to Create a MobiPocket eBook" thread.

"Home Edition has a simple to use interface and is designed to produce content for private use. When creating new files from scratch you can use predefined templates to aid in the creation effort. A user can also use the windows version of MobiPocket Reader to convert files.

Installing [the Publisher Edition] provides the most power to customize the output of the file and is required to submit eBooks commercially. if you are a publisher and intend to sell eBooks through eBookbase, this is the version of the Mobipocket Creator that you should use. Additional features essential for publishers include: the encryption level required by eBookbase; an integrated "deploy" feature to automatically upload or update your books in eBookbase; the metadata editor to set the price, ISBN, cover image... of your books; PDF import".

https://www.mobileread.com/forums/showthread.php?t=17914

https://wiki.mobileread.com/wiki/MobiPocket_Creator

Installed Publisher. After it reads a PDF, it generates an HTML file and pictures. "Build" creates a .PRC file, which you don't need. Once you have the HTML file, doctor it in the Caliber editor or Sigil, before turning it into EPUB.

From a PDF, creates a single HTML + multiple PNGs.

Tools to crop PDF

Q&A

Mediabox, cropbox, bleedbox, trimbox, artbox?

https://wiki.scribus.net/canvas/PDF_Boxes_:_mediabox,_cropbox,_bleedbox,_trimbox,_artbox

https://pdfcpu.io/getting_started/box

mutool vs. muPDF?

https://www.mankier.com/1/mutool

muPDF is just a PDF viewer, while mutool is a command-line application to work with PDFs.

Why are some PDFs non-mouse-selectable? Why are some PDFs selectable but no copyable to the clipboard?

Either the pages are juste pictures and not text, or the PDF could be configured to forbid copying: "Denied Permissions: copying text".

qpdf --decrypt input.pdf output.pdf

If they're pictures, run the PDF through an OCR to add a text layout.

How to edit meta-data (title, etc.)?

Surprisingly, exiftool can also edit PDF meta-data:

exiftool.exe -o output.pdf -Title="My title" -Author="Some author" -Subject="My subject" input.pdf

Resources

Books

Sites