Home
Journal for the Camarilla Servers
 
[Most Recent Entries] [Calendar View] [Friends]

Below are 12 journal entries, after skipping by the 20 most recent ones recorded in cam_servers' LiveJournal:

    [ Next 20 >> ]
    Tuesday, December 7th, 2004
    7:59 am
    as soon as...
    the system is reset, it will be. I'm waiting (called in on friday) just like everyone else....sorry.

    In the mean time, cammail seems to be ready to take the lists without any problems. Transfering the 11Gb of archives/lists over is all we're waiting on for that, which is waiting for the system to be up to make the move from.
    Wednesday, December 1st, 2004
    1:10 pm
    sunday night, now
    apparently there was a power surge sunday night (or was that monday? I worked 15 hours yesterday and 12 on monday, so my timing might be skewed from that) that took out both servers.

    I got on cammail last night to start moving stuff over from camarilla, and found that...it wasn't there. Rather, it was, but it wasn't responding to actual requests for info.

    Assuming it wakes up today at some point though, several OOC lists will be moving over to cammail tonight. An additional large chunk will move over tomorrow. The entirity of the OOC lists will be moved by the week's end. IC lists are another matter, esp those that are for the old chronicle.

    Sorry for the delay in putting up this note - I've been with vendors all day down in the server room, far from any system that I could update LJ from, and I simply didn't think to do it last night.
    Tuesday, November 9th, 2004
    11:21 am
    mail issues, explained
    There seems to be continued confusion about the mail issue. Let me explain it in a way that it will hopefully be understood.

    When I post an email to the ordo dracul list, camarilla.white-wolf.com checks the mx record for lmco.com. It then says HELO (a mail server's command to introduce itself) to mailgwX.lmco.com (replace X with the actual number it goes to). It gives it's name. The ip is then gleened from the transmition, and mailgwX.lmco.com decides to find out who 206.65.59.149 is.

    It asks the root server(s). This is what it gets in reply:

    Asking a.root-servers.net for 149.59.65.206.in-addr.arpa PTR record:
    a.root-servers.net says to go to indigo.arin.net. (zone: 206.in-addr.arpa.)
    Asking indigo.arin.net. for 149.59.65.206.in-addr.arpa PTR record:
    indigo.arin.net [192.31.80.32] says to go to auth00.ns.uu.net. (zone: 65.206.in-addr.arpa.)
    Asking auth00.ns.uu.net. for 149.59.65.206.in-addr.arpa PTR record:
    auth00.ns.uu.net [198.6.1.65] says to go to Z1.NS.SJC1.GLOBIX.NET. (zone: 59.65.206.in-addr.arpa.)
    Asking Z1.NS.SJC1.GLOBIX.NET. for 149.59.65.206.in-addr.arpa PTR record: Reports camarilla.white-wolf.com. [from 209.10.34.55]

    Answer:
    206.65.59.149 PTR record: camarilla.white-wolf.com. [TTL 86400s] [A=206.65.59.149]
    206.65.59.149 PTR record: www.camarilla.org. [TTL 86400s] [A=None] *ERROR* A record does not point back to original IP.


    At least, that's what we *hope* it gets, and exactly 50% of the time that is what happens. The other 50% of the time, the last two lines (the answer) switch - the answer is www.camarilla.org.

    So when someone does a reverse lookup, they get back that answer - but they only look at the first response. Since www.camarilla.org is the first response 50% of the time, that mail gets rejected.

    What happens then? The email is resent. And resent again. And resent again. And again. And again.

    What can be done about this? 3 things:
  • simply change 206.65.59.149 to a different ip, and point camarilla.white-wolf.com to that ip instead. WW issue.
  • remove the damn www.camarilla.org record. Earthlink (the authority on the name, apparently) refuses to do this. WW has lawyers talking to them, or so I am told.
  • Start sending mail from cammail.white-wolf.com

    Which are we planning on doing? #3, currently. Which will mean that in a couple weeks, all email will suddenly be coming from somewhere else! This will take a bit of work in mailman to change everything over, and will break filters people have that explicitly name servers.

    So when an email doesn't arrive right now, this is the reason. The emails are actually zooming through quite nicely with zero delays - email sent to a list returns in seconds generally (what your MX does with it then is their ball). The servers aren't bogged down, there is simply a delay in retrying a failed attempt at sending a particular email. At 50% a shot, that could -potentially- be a long delay.

    Let me know if this doesn't make sense. No, not all mail servers do reverse lookups - those that don't are getting all mail instantly. Its those of us that have mail servers run by anti-spam admins that have the issue (since the reverse lookup is done for anti-spam reasons).

    And yes - that means that we often look like a spam box to many servers out there...what with the volume of mail (sent to lists, not specific people), and the failed lookups.

  • Friday, October 29th, 2004
    4:07 pm
    updates saturday morning!
    Saturday morning, approx 5min after I wake up (giving me time to eat some oatmeal or something), I will be updating apache and postfix on the servers. I am planning on this being ~7am EST, and it won't be started later than 8:30am EST. There will actually only be less than a minute of downtime with apache, and any downtime with postfix won't be noticable since mail will spool for the few minutes it is down (ie - no service impact).

    Current Mood: geeky
    Current Music: Covenant - Europa
    Thursday, October 28th, 2004
    3:26 am
    yes, we know
    sorry for the delay in verifying that we know, but yes...we do in fact know.

    Both servers are gone, so failing over to the second one just won't be happening. It also limits the number of things that can be reasonably expected to be the cause (since cam2 wasn't even being used this evening).

    Will have to wait until I can reach folks at WW in the morning...until then, relax and enjoy the moment of silence.
    Thursday, October 7th, 2004
    10:01 am
    updates, downtime
    made a number of updates last night, including getting past one of the main issues that had been giving me problems (or, is that problems that had been giving me issues?)

    a new kernel, with things I've been wanting (I feel like its Christmas!) is compiled, and ready to go into place. We're currently 2.4, will be finally moving to 2.6. Fun times to be had.

    While there is a process by which to change a running kernel...with far much energy, I can type "reboot" and just give people warning before-hand. This is that warning - tomorrow morning, at ~7, I will be rebooting the system with the shiny happy new kernel. I am doing it then so that if something goes wrong, WW can help me with the "no physical access" problem I face.

    I'm the Cam Servers, and I endorse this message.
    Wednesday, September 22nd, 2004
    11:07 pm
    undelivered email
    Greetings. Some of you may have noticed from time to time that mail simply does not arrive at your inbox.

    I'm not talking about the short times when no one is getting mail, ie mail that is eventually delivered, I'm talking about email you don't get ever from the cam server.

    Well, let me explain why this is happening. In kinder, gentler times, the internet was formed with nary a thought that something as vile as spam could some day account for the vast majority of email traffic, or when porn and illegal file transfers would push the backbone bandwidth further and further.

    In this world, DNS, email agents (like sendmail), and other backbone services were formed.

    In modern times, spam does exist, and one of the ways ISP's combat spam is to do something called a "reverse lookup" on someone sending an email. A DNS lookup entails looking up the IP for hostname (you have camarilla.white-wolf.com, for instance, and want to know how to get there - ie, the IP address). A reverse lookup, on the other hand, is when someone takes your IP and looks up the hostname. It is one of the many ways ISP's can verify that someone that is sending them an email is not a spam server - if where they're coming from does resolve back to the hostname they're claiming, then they accept the email.

    http://www.dnsstuff.com/tools/ptr.ch?ip=206.65.59.149

    Refresh that page a few times, and you'll see that half the time, www.camarilla.org comes up before camarilla.white-wolf.com. As the page says, most people who do reverse lookups only check the first entry. Since the first entry listed is www.camarilla.org half the time, and since that entry is broken anyway, half the attempts to send email to ISP's that do this reverse lookup test fail. Since camarilla.org is still wrapped up in legal-land, its been a little bit of an issue.

    Does that mean that the email is gone forever? No; if it did, far more email would be lost than is. What it means is that later, the email is sent again. Again, it has a 50% chance. If it fails again, it tries again...I don't recall off-hand how many times postfix will try, but it is more than once. That, and not all ISP's are doing this (only about a 1/4 or less, really). The numbers do add up to something significant though.

    This is issue is now potentially resolved, though...it takes a while for DNS updates to propogate through the whole of the internet, but in theory by this time tomorrow only camarilla.white-wolf.com will be a record for that address. Then, fewer emails will be lost, and more importantly the servers themselves will have less work to do (since they won't have to keep resending emails).

    Just an FYI, and a note letting all know its being resolved.

    NOTE: this does not serve as an excuse to claim you sent something, but that the server "ate" it. Nothing of that sort has happened. /All/ this means is that outbound mail sometimes doesn't arrive. That email will still be in the archives, and any posts sent TO the cam servers DO get to the cam servers. It only works one way, not both.

    Current Music: Apoptgyma Berzerk - Welcome to Earth
    Sunday, September 19th, 2004
    7:29 am
    update
    Mysqldump (the process for backing up the database, so it can be recreated somewhere else, or recovered) had errors yesterday morning that I had to deal with that took me outside of the acceptable downtime window. As such, I'm going to try to do the move this morning as quickly as I can; the web pages will be down for a few minutes, and mailman won't be sending anything for about an hour (the emails will still be there).

    Fortunately, I was instead able to spend time making sure that various performance-related issues were taken care of on the other server, and to get the kernel where I wanted it.
    Thursday, September 16th, 2004
    2:49 pm
    moment of silence - 9/16/2004 14:25-14:40
    That moment of silence brought to you (most likely) by the same 5 letters as the last 2 moments of silence. The cause is known, and will be resolved this weekend. A few broken libraries...but instead of trying to fix/replace them remotely with no ability to interact with the systems in person should something go amiss, I'll be switching cam1 and cam2's natural positions.

    I realize this was suggested as last weekend, but other needs arose. Instead, it will be this saturday morning, at approx 6am. The downtime will be approx 30 minutes, but could be much less or a little more depending on traffic volume.
    Wednesday, September 15th, 2004
    11:19 pm
    mysql
    mysql just blipped a couple times; what was going to be a seemless transition to innodb support burped. Innodb support is there now though.
    Wednesday, September 8th, 2004
    12:12 pm
    Sept 7 downtime
    some time after 6pm, the lists stopped working. The port was responsive, it was simply not sending anything (ie - standard monitoring tools were useless). The admin logged in sometime between 7 and 8, started an emerge, then went downstairs to paint the ceiling in a room in his newly-converted basement. When he came back, his term to cam1 was locked...and no further sessions were possible. The system was still up, responsive to pings and various things, but was not responsive to get's from the web server. It also stopped sending mail.

    The reason for this is unknown until the admin can look through the logs, but it is likely related to a c library he (I) had seen complaints about a day before, leading to the discovery of a typo in a key file (make.conf, for gentoo-aware folk). The extended downtime was due to the fact that nothing set off the hangcheck timer, and the two people the admin (me) knew to contact at WW were not there. Fortunately, one did get there a short time ago and save the day.
    11:56 am
    introduction
    Greetings. A bit of background, first.

    There are 2 servers - cam1, and cam2, though they do not resolve with those names (only a third ip, which neither owns, resolves...as "camarilla.white-wolf.com").

    When cam1 and 2 arrived to be built, some of the components had problems. NewEgg (should I get kickbacks for using their name?) quickly replaced those components, and out the door it went.

    Cam1 houses the web portals, and email. Cam2 is available for backups, devel, failover, etc, etc. Theory vrs practice - more on that another time.

    Tonight, now that cam1 is back, cam1 will be becoming cam2, and vice-versa. Cam1 was discovered to have a typo in a very important file 2 days ago, which may have contributed to several of the non-power and network issues. More on this change in the following post.

    The future: cam1 (current cam2) will be the primary web portal, and will run the lists. Cam2 (current cam1) will run several services that have been unannounced so far to the general public (they'll be really nifty, though). Both will monitor the other, and will take over the work of the other if it fails. This will make all services slower, but available, until the failed system returns.

    Keep an eye on this journal if you want to know about planned outages, server updates, and post-event reasons for unplanned outages.

    Yours,
    The Tech Team
[ Next 20 >> ]
About LiveJournal.com

Advertisement