RSSToPdf

TurnToJPG -->


AIM

Using python for fetching back some blog articles and convert them into pdf files, send it to some specified mailbox.

Preparation

The script depends on python library:

  1. feedparser
  2. pdfkit

Install them via:

$ sudo pip install feedparser
$ sudo pip install pdfkit

pdfkit depends on wkhtmltopdf, install it on ubuntu via:

$ sudo apt-get install -y wkhtmltopdf

Configure wkhtmltopdf, because in vps we don’t have X Window:

# apt-get install -y ttf-wqy-zenhei xvfb
# echo 'xvfb-run --server-args="-screen 0, 1024x768x24" /usr/bin/wkhtmltopdf $*' > /usr/bin/wkhtmltopdf.sh
# chmod 777 /usr/bin/wkhtmltopdf.sh 
# ln -s /usr/bin/wkhtmltopdf.sh /usr/local/bin/wkhtmltopdf
# which wkhtmltopdf
/usr/local/bin/wkhtmltopdf

Script

The python script is listed as following:

import sys
import pdfkit
import feedparser
reload(sys);
sys.setdefaultencoding("utf8")

options = {
    'page-size': 'Letter',
    'margin-top': '0.75in',
    'margin-right': '0.75in',
    'margin-bottom': '0.75in',
    'margin-left': '0.75in',
    'encoding': "UTF-8",
    'custom-header' : [
        ('Accept-Encoding', 'gzip')
    ],
    'cookie': [
        ('cookie-name1', 'cookie-value1'),
        ('cookie-name2', 'cookie-value2'),
    ],
    'no-outline': None
}

htmlhead = """
<html>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <head>
            <title>BlogList</title>
    </head>
<body>
"""

htmltail = """
</body>
</html>
"""

Article = htmlhead


d = feedparser.parse('https://feeds.feedburner.com/letscorp/aDmw')

for post in d.entries:
    # Post Title
    #print post.title
    Article += "<h2>" + post.title + "</h2>"
    # Post Content
    #print post.content[0].value.rsplit('span', 2)[0][:-4]
    #Article += post.content[0].value.rsplit('span', 2)[0][:-4]
    Article += post.content[0].value

Article += htmltail
print Article
#pdfkit.from_string(Article, 'output.pdf', options=options)

Unfortunately the last line won’t work, cause we are working under terminal, we use the wrapped wkhtmltopdf, so we comment it, and redirect our output into a html file, manually convert from html to pdf.

Usage

Output pdf via following command:

$ python fetch.py>fetch.html
$ wkhtmltopdf fetch.html output.pdf

The generated output.pdf contains the latest 10 articles in blog, like following screenshot image:

/images/2016_12_29_17_02_52_885x636.jpg

Further Works

  1. Could it fetch more blog items via rss?
  2. Crontab for sending out pdf as attached files to specified email box?
  3. Judgement from date?
  4. Less size of pdf file(shriking the image size)?
  5. Use CSS for beautify this output pdf?