Thursday, February 26, 2015

FreeBSD From the Trenches: ZFS, and How to Make a Foot Cannon

This month's story comes to us from Glen Barber, UNIX Systems Administrator.

The ZFS filesystem is regarded for its robustness and extensive feature set.

Its robustness can be haunting, however, if a mistake is made.  I learned this the hard way through a seemingly innocent typo, a mistake I certainly will not soon repeat.

We use ZFS almost exclusively in the FreeBSD cluster.  I say "almost" because there is one remaining machine that does not use ZFS, because the machine is too underpowered to handle it.

All machines are installed in a netboot environment while logged in at the serial console, providing the utilities necessary for extremely customizable installations.  Most of the installations I have performed on machines in the FreeBSD.org cluster have been pseudo-scripted, with subtle differences depending on the machine, such as if the disks are da(4) or ada(4), the number of disks, how much space to allocate for swap, the number of ZFS pools, and so on.

For the most part, a basic installation would be done with a very simple sh(1) script that looks something like:

# for i in $(sysctl -n kern.disks); do \
  gpart create -s gpt $i; [...]; done
Nothing too fancy at all.

Most times I would copy/paste from an installation script I've used for years, other times I would manually type the commands.  It really depended on what the end result was supposed to be, as far as configuration.

When I installed the FreeBSD Foundation's new server, I typed the commands manually.  You might ask, "Why did you do it this way?"  To this day, I cannot answer that question.  But if I didn't, this story would be far less interesting.

The machine was installed like this, almost verbatim:
# for i in $(sysctl -n kern.disks); do \
  gpart create -s gpt /dev/${i}; \
  gpart add -t freebsd-boot -s 512k -i 1 /dev/${i}; \
  gpart bootcode -b /boot/pmbr \
  -p /boot/gptzfsboot -i 1 /dev/${i}; \
  gpart add -t freebsd-swap -s 16G -i 2 /dev/${i}; \
  gpart add -t freebsd-zfs -i 3 /dev/${i}; \
  done
# zpool create zroot mirror /dev/ada0 /dev/ada1
# for i in tmp var var/tmp var/log \
  var/db usr usr/local usr/home; do \
  zfs create -o atime=off zroot/${i}; \
  done
This creates the GPT partition scheme for all available hard disks, writes the partition layout to the disks, writes the GPT boot code to the first partition on each disk, and allocates the swap space and ZFS space.  Then it creates the ZFS pool named 'zroot' configured as a mirror, and creates the ZFS datasets in the new pool.

The problem is not too obvious unless you are looking for it specifically, but instead of using the 'freebsd-zfs' GPT partitions, which are /dev/ada0p3 and /dev/ada1p3, I created the pool on the full disk (/dev/ada0 and /dev/ada1).

Simple enough to fix, right?  Destroy the 'zroot' pool, destroy the GPT partition layout to be safe, and create it again with the correct arguments to 'zpool create'.

So, that's what I did.

Luckily I wasn't ready to put this machine into production yet.  I still wanted to do some basic stress testing on the machine before moving anything critical to it.

Fast forward about a month.

After being satisfied that the machine did not have any obvious stability problems, such as faulty RAM for example, and after having lowered the relevant TTL entries in DNS, I decided to do one more upgrade on the machine before beginning the independent service migrations to the new machine.

This is where things started to go wrong.  Fast.

The source-based upgrade finished, and I rebooted the machine.  In another terminal, attached to the serial console, saw the machine proceed through the normal reboot routines, killing running services, syncing buffers, and so on.

After the machine completed POST routines, everything went dark.  The machine did not respond to serial console input, and as far as I could tell, this was not due to a change caused by the update.

I should note that, by nature, I am a paranoid sysadmin.  This is a good quality, in my opinion, because I habitually go out of my way to make sure any situation is recoverable if something goes wrong.  Suspecting I did something wrong, I immediately began reviewing the history recorded while being logged in at the console.  Nothing looked suspicious.  This upgrade should have "just worked."

I remotely power-cycled the machine, and booted into our netboot environment to investigate further.

I immediately knew something went wrong after importing the 'zroot' pool into a temporary location, and seeing several tell-tale signs.  For starters, /etc/rc.conf had a timestamp that predated the machine from even being shipped to the colocation facility.  More confusingly, /usr/obj was empty, as if the 'buildworld/buildkernel'-style upgrade that took place less than an hour prior had never happened.

Then panic ensued.  The machine didn't panic -- I did.

Everything was gone.

Every configuration change since the initial install, every jail that was created, every package that was installed.  All of it.  Just gone.

While investigating, I sent a heads-up to the other cluster administrators in case there was an issue that affected other installations.  As investigation progressed, Peter realized he had seen this exact behavior in the past, and provided an example scenario with which it could occur.

It was exactly what I had done - used the raw disk for the ZFS pool instead of the 'freebsd-zfs' GPT partition.

So, what's the problem?

The problem is 'zpool destroy' does not implicitly delete pool metadata from the disks, so as far as ZFS is concerned, I had two different ZFS pools, both named 'zroot', which confused the boot blocks just enough to import the wrong pool at boot.  Well, it didn't just import the wrong pool, it imported an empty pool.

Worse yet, because I had allocated the partitions in the order of 'freebsd-boot', 'freebsd-swap', and 'freebsd-zfs', and that 'freebsd-swap' consisted of 16GB, the swap partition had more than enough space to hold on to the metadata from the pool I did not want to exist.  There was no way to force one pool to be chosen over the other, and worse, no way to tell which pool would be chosen by the loader.

The only good news at this point was that the machine was not yet in production.

How do you fix this, then?

Peter had a suggestion, since he has run into this before.  Reboot the machine into the netboot environment, and try to force the correct pool into being imported by forcibly removing all device entries for the disks and retrying the ZFS pool import.  This would be done by running:
# rm -f /dev/gptid/* /dev/diskid/* /dev/ada?
# zpool import -o altroot=/tmp/zroot zroot
Unfortunately, the wrong pool was imported again, most likely (but unconfirmed) by allocation such a large amount of swap to the disks.
# zpool status
     NAME        STATE    READ WRITE CKSUM
     zroot       ONLINE 
       mirror-0  ONLINE 
         ada0    ONLINE 
         ada1    ONLINE
Then I realized the partition table was also corrupt.

After several attempts to coerce the correct pool to import, I became increasingly more uncomfortable with leaving the machine in this condition. At this point, there was only one solution - wipe the disks, and start over.

Ultimately, despite disliking the solution, that is what I did to correct the problem, though at the time, I was unaware of the 'labelclear' command to zpool(8), which would have wiped the ZFS pool metadata from the disks.  But at that point, I was not going to take any chances either way.

The takeaway is, despite how innocent a mistake may appear at first, when dealing with metadata stored on disk devices, it surely will come back to haunt you at some point sooner or later.

Wednesday, February 25, 2015

SCALE 13x Trip Report: Michael Dexter

The Foundation recently sponsored Michael Dexter to attend SCALE 13x. Michael provides the following trip report:

SCALE 13x was the 13th Southern California Linux Expo and took place February 19th through 20th in Los Angeles, California. Despite its name, this year's event demonstrated sincere outreach to the BSD community as demonstrated by two booths and several BSD-related talks. The first booth featured FreeBSD, the FreeBSD Foundation, FreeNAS, PC-BSD and pfSense while the second featured OpenBSD and NetBSD. Both booths were filled with familiar faces including Dru Lavigne, Denise Ebery, Matt Olander, James Nixon, David Maxwell, Brooke and Seth and two toddlers!

The FreeBSD Booth Crew -
Photo courtesy of iXsystems

The variety of booth visitors were very familiar for SCALE: a mix of students, consultants, open source developers and military/aerospace contractors. I heard lots of "I got started on FreeBSD" and "I use FreeNAS" plus the occasional "When can we have a military-certified BSD so we can stop using Linux?" The last one is something I have heard at every SCALE I have attended and is representative of the region. Hats off to the SCALE organizers for also attracting such a diverse
audience.

The BSD-related talk topics included David Maxwell's newly-released pipecut that he debuted at MeetBSD (https://code.google.com/p/pipecut/), Brooks Davis' talk on the BERI CPU that he is working on with Robert Watson, Dru Lavigne's talk on new FreeNAS 9.3 features and my talk on FreeBSD Virtualization Options. There were also many overlapping talks such as those on various system containers, embedded systems and of course Brendan Gregg's talk on systems performance. Brendan kindly updated the Netflix statistics that I was already going to address and both Bryan Smith and Randal Schwartz had great user questions. It truly was a pleasure to speak at SCALE and my sincerest thanks to Brendan for live Tweeting my talk.

Impressively, some SCALE speakers were in their teens and the overall outreach to kids was great including an evening kids-only event. The BSD Certification Group scheduled a BSDA exam but alas it was poorly attended. I humbly invite you to take the BSDA exam if you have not done so already and ask that you help spread the word whenever you get a chance.

In a community where we often preach to the converted, I find SCALE to be a very receptive venue for outreach and encourage you to attend and consider submitting a BSD-related talk to SCALE 14x. Special thanks to Gareth Greenaway for reaching out to the BSD community and for the great attitude demonstrated by his team of volunteers. Finally, I would like to thank the FreeBSD Foundation for covering my air travel and O'Reilly Media for allowing me to share a room with one of their amazing team members.

Friday, December 12, 2014

More From Your Newest Board Member: An Interview with Cheryl R. Blain

Recently, The FreeBSD Foundation announced the addition of Cheryl R. Blain to the Board of Directors. We sat down with Cheryl to find out more about her background and what brought her to the Foundation. Take a look at what she has to say:

Tell us a little about yourself, and how you got involved with FreeBSD?
I was bit by the entrepreneur bug in 1999 when working for a non-profit. I’ve worked with high-tech, venture-backed, small-cap companies ever since.  My typical engagement finds

Cheryl R. Blain
me streamlining operations and sales teams to prepare companies for their next step forward, which most often involves financing.  

I have a master’s degree in business administration with a dual emphasis in finance and sustainable enterprise, from Saint Mary’s College and as a visiting student at UNC Kenan-Flagler.

Xinuos is the latest high-tech, venture-backed company to which I’ve plied my wares.  While working for Xinuos, I was exposed to FreeBSD for the first time in 2013.  During my first week on the job, I was asked if I was willing to go to Ottawa, Canada to learn more about FreeBSD and the community of developers.  The head of engineering and I felt the conference was very important to Xinuos’ future, so we decided it was an opportunity not to be missed.  Since the trip was so unexpected, I actually had to have my passport over-night shipped to me in our New Jersey office so I could leave the following day!  My colleague and I attended BSDCan and it was everything we had hoped it would be.  We were welcomed by the development community and pleasantly inundated with inquiries about our interest in FreeBSD.  David Chisnall was an especially helpful evangelist of FreeBSD, and made sure my colleague and I had the information we needed.

Why are you passionate about serving on the FreeBSD Foundation Board?
The FreeBSD community (including the board) is in no small part the reason I chose to learn more about the project as a commercial offering two years ago.  My passion is in building businesses, and I wanted to work on a project that was technologically sound, well supported and attractive to people who I like and respect.  The FreeBSD community quickly forgave me for being the least technical person in the room, and was wonderful in embracing the value I can bring to the community from a business perspective.

I look forward to doing my part to ensure that the FreeBSD project has a vibrant future.

What excited you about our work?
There are many things that make FreeBSD interesting...but the first time I think I got really excited was in Ottawa in 2013, when Matt Ahrens gave his talk on ZFS.  Every developer in the room was abuzz with excitement.  In Matt’s presentation he listed logos of the other open source operating systems using ZFS, but I connected with how the room full of BSD developers really embraced Matt as their own.  His bold move to pack his box at Oracle to continue his open source work, helped me realize the people associated with FreeBSD are not status quo...they are pushing the envelope. Then I met Peter Grehan and Neel Natu and was introduced to their work on bhyve, and Justin and George as Foundation board members and FreeBSD committers and knew that even though the FreeBSD project has been around since 1993, new excitement and innovation is happening right now.  And I haven’t even mentioned Capsicum or Clang! Oh and I can’t forget, I was there for the naming of Groff with all the rowdy laughter and good spirited banter, and it was then that I felt like I was among friends.   

 What are you hoping to bring to the organization and the community through your new leadership role?
I hope that my participation in the planning discussions will encourage other business leaders to join in the discussions as well.   

I also hope to encourage those who use FreeBSD commercially to become more vocal about their experiences and use cases, to encourage others to develop with FreeBSD as well.  In doing so, there is a great opportunity to build an endowment among alum to ensure a vibrant future for FreeBSD.

How do you see your background and experience complementing the current board? 
I will be delighted if I am successful in bringing a business lens to the board discussions.  I would like to help elevate FreeBSD in the minds of technology companies worldwide and see a broader acceptance of the OS as a commercially desirable alternative.

Thursday, December 11, 2014

Super Computing Trip Report: Michael Dexter

Michael Dexter has also provided his trip report for Super Computing:

In case you have not heard of the Supercomputing.org conference, it is a meeting of 10,000 researchers, computer scientists, engineers, students, managers, sales engineers and three-letter agency representatives that takes place in a different US city every year. I have hosted a booth at the event since 2009 when it passed through Portland and this year showcased the bhyve Hypervisor and explained all things BSD to brilliant attendees from around the world. I was joined by Patrick Masson, General Manager of the Open Source Initiative, who helped shed light on the pervasive yet unrecognized use of open source software by the universities, organizations and companies at the event. Literally 90% or more of the exhibitors rely on open source but few give it any recognition. For years, GNU/Linux has dominated the Top500 list of supercomputers that is announced at the event each year and I set out to help change that by highlighting bhyve, OpenZFS and other great technologies in FreeBSD.

SC14 could not have started on a better note thanks to the announcement on the first day that the FreeBSD Foundation received a million dollar donation from WhatsApp founder Jan Koum. I heard many people say "I used FreeBSD ten years ago" and the news instantly got their attention and set the tone for the rest of the event. By showcasing ZFS, we drew the attention of ex-Sun Microsystems engineers and executives and even had a visit by UC Berkeley CSRG research assistant Clem Cole. The message that "BSD is back" was loud and clear and I canvased the Student Cluster Competition to help inspire a new generation of users who had never heard of the BSDs.

The bhyve booth was in the heart of the ARM pavilion which made for some enlightening conversations. bhyve and the ARM CPU architecture both stand out for operating without emulation, resulting in simplicity and performance for bhyve and significant power savings for ARM. A roadmap exists for bhyve support on ARM and hopefully this will be something to showcase at SC15. Of the exhibiting ARM partners, the SoftIron team stood out as loud and proud users of FreeBSD and I look forward to seeing them at future BSD events.

FreeBSD vendor iXsystems was also at the event demonstrating FreeNAS and TrueNAS, as were the SaltStack team who received a bhyve demo and expressed a sincere desire to include support for bhyve. A handful of other open source vendors like Red Hat were in attendance plus FreeBSD consumers like Spectra Logic, EMC/Isilon, NetApp and Juniper. Many individual open source users came to the booth and my favorite quotation came from a conversation at a Mellanox event: "Our administrators use FreeNAS at home and come work and ask 'why the heck aren't we using ZFS?'" Open source is winning but there is still much work to be done.

Speaking of work, I asked many people, including Navy researchers moving massive uncompressed video streams, what FreeBSD needs to do get back on the Top500 list of supercomputers. The short list of answers I received was: OFED/OpenFabrics Enterprise Distribution support, OpenMPI/Message Passing Interface support and Lustre distributed file system support. Surprisingly, NUMA/Non-Uniform Memory Access did not come up. Interconnect vendor Chelsio Communications stood out as a solid supporter of FreeBSD and dominant player Mellanox expressed interest in expanding their support for FreeBSD given the opportunity it represents. All in all, people were very receptive to giving FreeBSD and other BSDs a try, especially given that it would be a homecoming for so many users.

I wish to thank the FreeBSD Foundation for sponsoring the bhyve booth at SC14 and I am delighted to hear that ARM has just made a generous $50,000 donation to the Foundation. In total I gave out 250 tri-fold brochures and talked to hundreds of people at SC14. Hopefully those seeds will take root and we will start seeing FreeBSD systems in the Student Cluster Competition and on the 2015 Top500 supercomputer list!

Wednesday, December 10, 2014

FreeBSD Foundation Welcomes New Board Member - UPDATED!

The FreeBSD Foundation is pleased to welcome Cheryl R. Blain to the Board of Directors. 

Cheryl became involved with the FreeBSD community in 2013.  She joins the Foundation's board with extensive experience managing software development and building strategic alliances for privately-held, small-cap companies. Cheryl's background includes community outreach, marketing and fundraising efforts with non-profit organizations. We are thrilled to have her as part of the team.

One of the responsibilities of our board is to focus on the big picture, by defining our vision, mission, strategic direction, project planning, as well as governing our organization. Our board has decades of experience on working on FreeBSD in design, development, documentation, research, education, and advocacy. We've been strong in providing support in the project development area. As we've grown, we've identified the need to expand our board, and we've identified skills, talents, and experience we want in new board members. 

Cheryl fills the need for bringing on someone who has a strong business development background. She will help provide a clear direction, strategic planning, and guidance for us to support FreeBSD in the future. In order for us to continue our growth, we need a more stable and consistent funding pool. Cheryl's extensive fundraising background and business connections will help us build and strengthen our business relationships to encourage multi-year donations.  She brings with her a passion for FreeBSD and a desire to use her talents to advance the mission of both the Project and the Foundation. Hear more from Cheryl here.

Please join us in welcoming her to the board.

MeetBSD Trip Report: Michael Dexter

The Foundation recently sponsored Michael Dexter to attend MeetBSD, which was held in California in November. Michael provides the following trip report:

This year's MeetBSD California marked a departure from its UnConference roots in favor of a showcase of exciting new developments in the community. Western Digital kindly hosted the event which made for a pleasant, professional atmosphere and attendees traveled from as far as Japan and Eastern Europe to attend.

Of the many talks, the Sony confirmation that is a long-time BSD user was simply historic and just may be the result of years of encouragement by AsiaBSDCon attendees. It's not every day that you confirm the existence of millions of more BSD users! Yes, "BSD" users at the request of the Sony legal department. On the same theme, "600M+ Unsuspecting FreeBSD Users" by Rick Reed of WhatsApp also shed light on the heavy lifting companies are doing with FreeBSD and finally, Scott Long and Brendan Gregg of Netflix reminded us how they are pushing 1/3rd of US Internet traffic each evening. Brendan spoke about performance analysis strategies at both MeetBSD and the Developer Summit that followed and I dare say is downright giddy about the performance analysis options available on FreeBSD. In his second talk he incorporated audience feedback on the spot and I for one am delighted to see Sun Microsystems refugees like Brendan come to the BSD community as they each bring a wealth of experience.

Kirk McKusick's “A Narrative History of BSD” was a delight as always and reminded us that there is absolutely nothing like BSD: professional and open source from the start with a mission to bring sanity to government computing. That mission sounds more like a contemporary meme than 1970's and '80's funded government initiative! Kirk told us about Bill Joy's prolific coding and how they navigated the pressure to incorporate the BB&N network stack into BSD. Kirk also told us the story of how a delay in grant funding accidentally got him into a lifetime of fast file system development and how we almost had 48-bit IP addressing. Hearing both Kirk and Brendan Gregg talk about the frivolity of most benchmarks decades apart was eye opening!

Finally, David Maxwell's "Pipecut" talk was a mind-blowing introduction to a pet project of his that promises to change how we all use the Unix command line. Most of these talks are online and can be found via meetbsd.com/agenda/.

As with any BSD event, the hallway track was worth the price of admission and I had the pleasure of meeting bhyve and FreeNAS developers that I had only met online. Adrian Chadd tinkered with a Surface Pro system and eventually got the keyboard working late one night and naturally had the only working WiFi in the hotel lobby. Glen Barber and I continued our "the good, the bad and the ugly" talk about distribution mirror layouts based on his work as FreeBSD release engineer and my work supporting various OSs on bhyve. Devin Teske provided scripting advice as always and I cornered people about topics ranging from the status of virtual networking and a ZFS panic.

Every BSD event has its own character and MeetBSD is no different. The fact that it takes place in Silicon Vally allows it to have a great mix of speakers and attendees who might not make it to international events. Thank you iXsystems for putting on yet another great MeetBSD!

Monday, December 8, 2014

The FreeBSD Cluster: Infrastructural Enhancements at NYI


I spent several days on-site at our east-coast US colocation facility in July 2014 and again in November 2014 racking and installing servers that the FreeBSD Foundation purchased for the FreeBSD Project.

This hardware is essential for supporting the FreeBSD Project in a number of ways.  It provides services for public consumption (FTP mirrors, pkg(8) mirrors, etc.), as well as resources that can be used by FreeBSD developers for various tasks, such as building third-party software packages, release building, and miscellaneous (a.k.a, "testbed") development of services for general use.

More Horsepower to Serve and Support the FreeBSD Community

Since July, fourteen machines were purchased for the east-coast US site, generously hosted by New York Internet in Bridgewater, New Jersey.

The servers were purchased with the end-goal being a complete mirror of the primary site on the west-coast US.  The newly-added servers bring the machine count at NYI to sixty-eight total.

Reorganizing for Redundancy

Two of these servers are being used as firewalls, each equipped with four-port Intel(r) NICs.  Both firewalls have direct connections to the switches in all four cabinets at NYI, providing a redundant uplink to each of the four switches so we can reboot either firewall without losing connectivity

Restructuring for Additional Services

November's site visit had two primary goals: install and configure the recent shipment of machines, and reconfigure the network topology behind the firewalls.  Before many of the machines could be brought online, several changes needed to be made to the network.

Each FreeBSD.org site further separates services behind the firewalls using VLANs, limiting each set of services provided within each VLAN to its own network restrictions.  In order to properly allocate network space for the new machines, several of the VLANs at NYI needed to be redone.

The most publicly-disruptive part of this was reallocating the VLAN that contains the firewalls.  Thanks to Peter Wemm, there were no major service disruptions (aside from a planned simultaneous firewall reboot).

Although not all of these machines have been brought online yet, several of them have been allocated and assigned to the teams that will be using them.

Two machines have been allocated to the FreeBSD Release Engineering Team, one of which was used for the 10.1-RELEASE builds.  Four machines have been allocated to the FreeBSD Ports Management Team, which were brought online and handed over just this week.

FreeBSD, Powered by FreeBSD

If you are like me, words about new hardware do not do as much justice as seeing them.  Enjoy!


new servers - front view
new servers - rear view