Getting the images – houghi.org

Now that you know where to get the images, you need to actually get them. Obviously there is the right click and then download them or click on the download button if that is available.

See that you have the size you want or the biggest size available. Downloading a thumbnail for your triple 4K monitors might not be what you want. The best is to save them to have separate directories. One for the ones that are downloaded and one that have been sorted.

Next to manual downloading there is the automatic download. I will cover two websites. Reddit and Wallhaven

On Reddit there are many subreddits that post images and even wallpapers. As an example I will use EarthPorn. Regardless of the name, it is not adult content. The first method is to download manually. You sort on New and then start downloading. I sue the old layout. I also use the plugin Snap Links that I have configured to click the middle mouse button,m drag over the links and open the links in a new tab. Easy and fast, but not really automated.

#!/bin/bash
# Rip reddit Earthporn

#set -x
# Parameters {{{
RSS_URL=https://www.reddit.com/r/EarthPorn/new/.rss
TMP=$(mktemp)
DIR=$HOME/Pictures/to_work_on/download/earthporn
RC="$HOME/.config/$(basename $0)/reddit.rss"
test -d $DIR || mkdir -p $DIR
test -f $RC && . $RC || PREVIOUS=0
test -z "$PREVIOUS" && PREVIOUS=0

# }}}
# Put the data in two lines {{{
DATA=$( lynx -dump $RSS_URL |\
        sed 's#><#>\n<#g' |\
        sed 's#&quot;#\n#g' |\
        egrep '^XX|updated|jpg' |\
        grep -v media |\
        sed 's#</updated>##g'|
        sed ':a;N;$!ba;s/\n<updated>/ /g' | \
        grep -v '^<' | \
        awk '{print $2, $1}' | \
        sort
        )       
# }}}           
# Get the images {{{
echo -e "$DATA" |
while read TIME URL
do
        DATE=$(date --date "$TIME" +%s)
        if [ "$DATE" -gt "$PREVIOUS" ]
        then
                PREVIOUS=$DATE
                wget -qO $DIR/$DATE.jpg $URL
                echo "PREVIOUS=$PREVIOUS" > $TMP
        fi
done

# Write the last time
mv $TMP $RC
#}}}

So what does it all mean? It is written in Bash for Linux.

The first part defines the link, the place where it should go as well as a place where some data is stored. It also does a few tests and sets the parameters.

The second part makes a list from the URL. If you do the following in a terminal RSS_URL=https://www.reddit.com/r/EarthPorn/new/.rss and thenthe command from lynx till sort you will see a list of jpg URLS. It is extremely ugly code, but it nicely turns the rss page into a list with times and images. This could be done way better with an RSS parser. Do so if you want to.

The next part will look at each line, see if the time is newer than the previous time (that will be set at the end). If it is, it will be downloaded and also the newer time will be set. If it is younger, it will be already downloaded and nothing happens.

At the end the new time will be moved to where ever $RC is read from in the beginning. I run this every hour using crontab. This can easily be edited to any group you desire. You muist edit the DATA part. Look at the source for the RSS file with lynx -dump https://..../.rss and use a lot of sed, grep, awk and magick to get the list. Do not forget to end with the |sort as this is important later.

Some groups will have links to other websites instead of images. You will then need to add more trickery and decide what the most used websites are and write a function for each website on how to get the image. This will differ from website to website.

Wallhaven.cc

For Wallhaven.cc we are going to use their API. The explanation can be found on this page. So here is the script:

#!/bin/bash
# Rip Wallhaven
#set -x

#Fixed parameters {{{
CFGDIR="$HOME/.config/ripping"
PURETY=110
FILE="$CFGDIR/wallhaven.cc"
DIR="/home/houghi/Pictures/to_work_on/download/Wallhaven"
CATS=101
API="InsertYourApiHere"
test -d $CFGDIR || mkdir -p $CFGDIR
# }}}
#{{{ The script
URL="https://wallhaven.cc/api/v1/search?purity=$PURETY&categories=$CATS&apikey=$API"
test -d $DIR || mkdir -p $DIR
        
TMP=$FILE.tmp
test -f $FILE ||touch $FILE
lynx -source "$URL" \
        | sed 's/,/,\n/g' \
        | sed 's#\\##g' \
        | grep path \
        | awk -F '"' '{print $4}' > $TMP

for LINK in $(grep -Fxv -f $FILE $TMP)
do 
        #echo $LINK
        BASE=$(basename $LINK)
        wget -q -O $DIR/$BASE $LINK
done

find $DIR -empty -type f -delete
mv $TMP $FILE
# }}}

exit

The first part are all the parameters are set. Change PURETY and CATS as you desire. Edit the API key or leave that part out from the URL. The seond part wil first make a nice list of all the links on that first page and writes it away.

Then it will download everything it found, except those that are already in the previous list of files. It then saves the new file as the old file for the next run. That way you do not need to download things twice.

This I run every 20 minutes with crontab. I am well aware that there are better ways to parse json. This works for me.

Other websites

With these two it was pretty easy. The first has RSS and the second an API. What if you have a website that does not have either? Well, here is what the process is. Details will be differ for each website. You can even search if rippers exist already.

First you need to get the start URL for the first page. e.g. on Wallaheve.cc, you go to Latest, make your selection and look at the URL. you get e.g. https://wallhaven.cc/search?categories=110&purity=100&sorting=date_added&order=desc&page=1 and that is where you start from. You then go to the first image. You click on it and then right click to see the image and look at the URL : https://w.wallhaven.cc/full/lm/wallhaven-lmpldy.jpg

Look at that filename.lmpldy is not something standard. So we remeber that and do lynx -source "https://wallhaven.cc/search?categories=110&purity=100&sorting=date_added&order=desc&page=1 "|sed 's/ /\n/g'|grep lmpldy to see several things. What we see in this case is several lines. One line starts with data-src. Mmmm. Let us replace the lmpldy with data-src and see what we get.

Now we have a list. And the links look pretty much like the link of the image. So we now need to use sed, awk, perl and other magic to turn the one link to look like the other. So going from data-src="https://th.wallhaven.cc/small/lm/lmpldy.jpg" to https://w.wallhaven.cc/full/lm/wallhaven-lmpldy.jpg we only need everything between the quotes, change th to c, small to full and add wallhaven- before the name.

Test if this works for the other images. If that works, you have a list. Write that to a file and adapt the script above so you can download them and only download them once.

So now you have a directory or more directories that are filled with images. You must now sort them.