• 2009-12-10 12:00:00

    Faster rsync and emege in Gentoo

    Scenario

    Recently I have started setting up a cluster of 7 Gentoo boxes for a project I am working on. The problem with boxes coming right out of the setup process of a hosting company is that they do not contain the packages that you need. Therefore you need to setup your USE flags and emerge the packages you require as per the role of every box.

    I have implemented the following procedure many times in my local networks (since I have more than one Gentoo boxes) and have also implemented the same process at work (we run 3 Gentoo boxes).

    The way to speed up rsync and emerge is to run a local rsync mirror and to use http-replicator. This will not make the packages compile faster but what it will do is reduce the resource usage (downloads in particular) of your network since each package will be downloaded only one time and reduce the time you have to wait for each package to be downloaded. The same applies with the rsync.

    My network has as I said 7 boxes. 5 of them are going to be used as web servers so effectively they have the same USE flags and 2 as database servers. For the purposes of this tutorial I will name the web servers ws1, ws2, ws3, ws4, ws5 and the database servers db1, db2. The ws1 box will be used as the local rsync mirror and will run http-replicator.

    I am going to set up the /etc/hosts file on each machine so that the local network is resolved in each box and no hits to the DNS are required. So for my network I have:

    10.13.18.101  ws1
    10.13.18.102  ws2
    10.13.18.103  ws3
    10.13.18.104  ws4
    10.13.18.105  ws5
    10.13.18.201  db1
    10.13.18.202  db2
    

    Modify the above to your specific setup needs.

    Setting up a local rsync

    Server setup (ws1)

    There is a really good tutorial can be found in the Gentoo Documentation but here is the short version:

    The ws1 box already has the rsync package in there. All I need to do is start the daemon. Some configuration is necessary before I start the service:

    nano -w /etc/rsyncd.conf
    

    and what I should have in there is:

    # Restrict the number of connections
    max connections = 5
    # Important!! Always use chroot
    use chroot = yes
    # Just in case you are allowed only read only access
    read only = yes
    # The user has no privileges
    uid = nobody
    gid = nobody
    # Recommended: Restrict via IP (subnets or just IP addresses)
    hosts allow = 10.13.18.0/24
    # Everyone else denied
    hosts deny  = *
    
    # The local portage
    [niden-gentoo-portage]
    path = /usr/portage
    comment = niden.net Gentoo Portage tree
    exclude = /distfiles /packages
    

    That's it. Now I add the service to the default runlevel and start the service

    rc-update add rsyncd default
    /etc/init.d/rsyncd start
    

    NOTE: If you have a firewall using iptables, you will need to add the following rule:

    # RSYNC
    -A INPUT --protocol tcp --source 10.13.18.0/24 --match state --state NEW --destination-port 873 --jump ACCEPT
    
    Client setup

    In my clients I need to edit the /etc/make.conf file and change the SYNC directive to:

    SYNC="rsync://ws1/niden-gentoo-portage"
    

    or I can use the IP address:

    SYNC="rsync://10.13.18.101/niden-gentoo-portage"
    

    Note that the path used in the SYNC command is what I have specified as a section in the rsyncd.conf file (niden-gentoo-portage in my setup). This path can be anything you like.

    Testing

    I have already run

    emerge --sync
    

    in the ws1 box, so all I need to do now is run it on my clients. Once I run it I can see the following (at the top of the listing):

    emerge --sync
    >>> Starting rsync with rsync://10.13.18.101/niden-gentoo-portage...
    receiving incremental file list
    ......
    

    So everything works as I expect it.

    Setting up http-replicator

    http-replicator is a proxy server. When a machine (the local or a remote) requests a package, http-replicator checks its cache and if the file is there, it passes it to the requesting machine. If the file doesn't exist though, http-replicator downloads it from a mirror and then passes it to the requesting machine. The file is then kept in http-replicator's cache for future requests. This way I save on resources by downloading once and serving many times locally.

    Although this might not seem as a 'pure speedup' it will make your installations and updates faster since the download factor will be reduced to a bare minimum. Waiting for packages like mysql, Gnome or others to be downloaded does take a long time. Multiply that time with the number of machines you have on your network and you can see the benefits of having a setup like this.

    Server setup (ws1)

    First of all I need to emerge the package

    emerge http-replicator
    

    Once everything is done I need to change the configuration file to suit my needs:

    nano -w /etc/conf.d/http-replicator
    

    and the file should have:

    GENERAL_OPTS="--dir /var/cache/http-replicator"
    GENERAL_OPTS="$GENERAL_OPTS --user portage"
    DAEMON_OPTS="$GENERAL_OPTS"
    DAEMON_OPTS="$DAEMON_OPTS --alias /usr/portage/packages/All:All"
    DAEMON_OPTS="$DAEMON_OPTS --log /var/log/http-replicator.log"
    DAEMON_OPTS="$DAEMON_OPTS --ip 10.13.18.*"
    ## The proxy port on which the server listens for http requests:
    DAEMON_OPTS="$DAEMON_OPTS --port 8080"
    

    The last line with the --port parameter specifies the port that the http-replicator will listen to. You can change it to whatever you want. Also the --ip parameter restricts who is allowed to connect to this proxy server. I have allowed my whole internal network; change it to suit your needs. Lastly the --dir option is where the cached data is stored. You can change it to whatever you like. I have left it to what it is. Therefore I need to create that folder:

    mkdir /var/cache/http-replicator
    

    Since I have specified that the user that this proxy will run as is portage (see --user directive above) I need to change the owner of my cache folder:

    chown portage:portage /var/cache/http-replicator
    

    I add the service to the default runlevel and start the service

    rc-update add http-replicator default
    /etc/init.d/http-replicator start
    

    NOTE: If you have a firewall using iptables, you will need to add the following rule:

    # HTTP-REPLICATOR
    -A INPUT --protocol tcp --source 10.13.18.0/24 --match state --state NEW --destination-port 8080 --jump ACCEPT
    

    You will need also to regularly run

    repcacheman
    

    and

    rm -rf /usr/portage/distfiles/*
    

    to clear the distfiles folder. I have added those in a bash script and I run it every night using my cron.

    Client setup

    In my clients I need to edit the /etc/make.conf and change the SYNC directive to:

    http_proxy="http://ws1:8080"
    RESUMECOMMAND=" /usr/bin/wget -t 5 --passive-ftp  \${URI} -O \${DISTDIR}/\${FILE}"</pre>
    

    I have commented any previous RESUMECOMMAND statements.

    Testing

    The testing begins in one of the clients (you can choose any package):

    emerge logrotate
    

    and see in the output that everything works fine

    ws2 ~ # emerge logrotate
    Calculating dependencies... done!
    
    >>> Verifying ebuild manifests
    
    >>> Emerging (1 of 1) app-admin/logrotate-3.7.8
    >>> Downloading 'http://distfiles.gentoo.org/distfiles/logrotate-3.7.8.tar.gz'
    --2009-12-10 06:46:47--  http://distfiles.gentoo.org/distfiles/logrotate-3.7.8.tar.gz
    Resolving ws1... 10.13.18.101
    Connecting to ws1|10.13.18.101|:8080... connected.
    Proxy request sent, awaiting response... 200 OK
    Length: 43246 (42K)
    Saving to: `/usr/portage/distfiles/logrotate-3.7.8.tar.gz'
    
    100%[=============================>] 43,246      --.-K/s   in 0s
    
    2009-12-10 06:46:47 (89.6 MB/s) - `/usr/portage/distfiles/logrotate-3.7.8.tar.gz' saved [43246/43246]
    .....
    

    Final thoughts

    Setting up local proxies allows your network to be as efficient as possible. It does not only reduce the download time for your updates but it is also courteous to the Gentoo community. Since mirrors are run by volunteers or non-profit organizations, it is only fair to not abuse the resources by downloading an update more than once for your network.

    I hope this quick guide will help you and your network :)