Why writing a Windows compatible file server is (still) hard

I don't often write about my day to day work, but sometimes I run across a problem that is so intransigent that it was a triumph when I finally fixed it. If you take an engineering job in the software industry, this is the kind of thing you might end up working on. If you find this column fun and interesting, then you might be a good candidate for a network engineer. Even if you don't I hope you'll appreciate the insane level of detail network engineers have to know on your behalf, to make something as simple as “saving a file” work seamlessly across operating systems.

One of the remedies imposed on Microsoft after they lost the European Union workgroup-server anti-trust case was the requirement to publish the full specifications for third party software to interoperate with their operating systems. They are still in the process of doing this, but there are now thousands of pages of documentation out there, in theory fully specifying the Server Message Block/Common Internet File System (SMB/CIFS) protocol that Samba and Windows file servers implement. So surely anyone and their auntie (assuming your auntie is a network engineer :-) can now write their own SMB/CIFS server by just reading this copious documentation. After all, now it's all documented how hard can it be?

A bug I fixed this week illustrates why I still think Samba is the leading choice for interoperability between Windows and Linux/UNIX systems. It concerns a strange tale of Microsoft Office and the “Offline Files” remote synchronization feature. “Offline Files” in Microsoft Windows allows a user to save a version of a file they're working on from a remote file system on their local laptop, and have it re-synchronized to a server when they get back online.

A user of Samba reported a bug that showed conclusively that trying to synchronize a Microsoft Office file against a Samba server wasn't working. The Windows client "Sync Center" application kept telling the user that the file on the remote Samba disk had been changed since it was saved, and he knew this wasn't the case.

It got stranger. It only happened with Vista, not with XP or Windows 2003. It only happened with Microsoft Office 2003 (all other versions of Office worked fine). It only reliably happened with Microsoft Excel, no other Microsoft Office application. Have I mentioned how much I hate Microsoft Excel? I quake in fear whenever I see an Excel interoperability bug logged against Samba. That application is perverse in the things it will do to a remote file server.

I looked at my nice new shiny downloaded Microsoft documentation. There was nothing related to this problem in there. The document describing the precise behavior of an NTFS filesystem as seen over the wire from an SMB/CIFS server is yet to be finished. They're still working on it. O.K. so let me check what happens when you use this version of Excel to do the very same thing against Windows. Maybe it's a real bug that fails against a Microsoft file server too, stranger things have been known? No, it worked fine against a Windows 2003 server, which to be honest did not surprise me. Microsoft tests the hell out of Microsoft Office before shipping any software that interacts with it in any way.

Right, time to get out the big guns. A debug log from Samba at our highest logging level, and a network packet capture trace (using the Open Source software “wireshark”) of when the problem was happening. Looking at the log didn't show any obvious errors, other than the fact that Excel does an insane number of operations over the network to do something as simple as a "Save File" (if you've ever wondered why Excel is slow, look at what it does over a network). A brief glance at the network capture trace didn't help either, everything looked fine except that on the save operation to the Samba server Excel strangely decided to abort half way through.

This was getting more interesting. It seemed to be a generic failure of the "Save" operation, nothing to do with the "Sync" feature at all. So let's test saving an Excel file against a Samba share without the "Sync" feature turned on in the client. Surely this must work, we also never ship a version of Samba without testing against Microsoft Office. Yes indeed, a normal save worked fine. So it was something to do with the "Sync" feature. But what could it be?

The only thing to do was to do a second wireshark trace from the client to a Windows 2003 server, and then compare the two packet traces, the "bad" against the "good", packet by packet.

Except of course it's not that easy (nothing in Windows interoperability ever is :-). Due to the differences in response times between servers, slight differences in supported features, and of course the fact that the Samba architecture is completely different from that of the Windows CIFS server, the packet streams soon become very different. But after you've been doing this work for seventeen years, you start to recognize the fingerprints of the broad actions that clients are trying to do, even with a protocol as chatty on the network as SMB/CIFS.

It took a couple of weeks of staring at the packet traces, on and off, but I eventually narrowed it down to a difference once Excel had written a temporary file out to the remote disk. Things started to be very different (and obviously wrong) at that exact point. So I started to look at the packets very closely.

The client was trying to set a "created" time stamp, to make the temporary file pretend to have been created at exactly the time as the original file. Now one of the interesting things in writing Samba is that is has to run on top of POSIX. A POSIX system is very different from Windows, so one of the challenges we have is to be able to emulate the different Windows features on top of standard POSIX.

A POSIX file system doesn't have a "create" time stamp, so when we're reporting back to Windows when a file was created, we have to look at all the available time stamps from the system, and just pick the earliest. This has always worked in the past, but maybe we'd finally run into a situation where we need that exact create time stamp as set by the client.

So I spent part of a day adding a temporary "created" time stamp into Samba, only held in memory. If this worked and fixed the bug I'd then find somewhere to store this on disk (probably in an “extended attribute”).

No, this still didn't fix it. This was starting to make me very angry as it made no sense. I stared at the packet traces again. Even more closely. Then something jumped out at me.

The SMB/CIFS protocol has a feature where a client can be notified when a change is made on a remote file or directory. It's called a “change notify”. Normally it's used to allow a client to discover when another client is modifying the same file system (it's the reason Windows “Explorer” windows spontaneously refresh with new files if a work colleague modifies the directory you're looking at). But even if a client modifies the file itself, the server still must send “change notify” packets to let the client know a file it has just requested to be modified has actually been modified. At the point in the packet stream just after the create time stamp change was requested the Windows server was sending a “change notify” packet, but the Samba server was sending the “change notify” after the file was written to instead. It was exactly the same packet, surely that couldn't be the problem?

I looked at our code. As POSIX can't store a created time stamp, if the client requests it to be changed (and no other time stamps) we simply return a success code. But we weren't sending a “change notify” back after this request, as technically we weren't changing the time at this point. Instead we were sending it back after the file write, when we were changing the file. So I added code to send the “change notify” back after the time stamp change.

And the bug disappeared!

I went into one of my colleague's office and kicked the hell out of one of the much loved Google beanbags, all the while screaming obscenities into the air for a good five minutes. He looked on with bemused amusement. I finally calmed down enough to explain the problem. One packet being returned at the wrong time. One single mis-timed packet caused a ripple effect in the Windows client file system software that was seen all the way up in the complex user interface of only that particular version of Excel, when interacting with the “Offline Files” feature, only on Windows Vista.

The remaining task was to add a regression test into our test suite, so that this specific bug is tested for before we release any new versions of Samba. The code isn't done until it's properly tested. But at least the user is now happy.

Interoperability with Windows is hard. But somebody has to do it. And if you're going to do something, you might as well try and do it well (and try and have some fun at the same time) :-) .

Stop the press. As I go to publish this the user still occasionally reports the failure even with the patch, just not as often. Looks like there may be a secondary timing effect in play as well. Oh well, no one can say this job is dull.

Jeremy Allison
Samba Team.
San Jose, California.
30th July 2009.



Comments

Serenity now....Serenity now....

Hoochie Mamamama..........

The subject at hand

Well, gee. If all you're gonna talk about is Microsoft, maybe you should call this "Chumps D'Lux".

Just kidding. I am an admirer of your work and appreciate the insights you provide through your blog.

Brutal...

...but fascinating. The real frustration is that the problem has not been completely solved though.

If only you had better documentation eh?

I once worked on a realtime

I once worked on a realtime project. We too "had a bug". After much testing I found that the interface worked for 21 of every 23 seconds. The designer's comment: "I didn't design it to work that way."

You do great work. Thank you, for samba, and this blog...

You remind me of ...

... the stork (from the latest Pixar's short movie Partly Cloudy) that delivers the baby ram, alligator, porcupine and shark. Tough work, but someone's got to do it.

Good job, and well done! :)

Cappella

No regressions?

Jeremy,

Introducing this timing fix introduced no regressions? Amazing!

Cheers,

Colin

None that our test suites can find.

And they are pretty thorough. Microsoft use them to test for regressions between different versions of Windows.

Jeremy.

Follow up

I'd really love to see a follow up to this, because the update at the end has piqued my curiousity. I'm not a network engineer, or even a programmer, but I do play Magic: the Gathering (this is a strange example, but bear with me), and I'm a real stickler for the rules - exact wordings of the cards, exact steps in each of the phases. It irritates me no end when people start doing things out of sequence, or skipping steps, because doing things like that has the potential to really crew up the game (and don't even get me started on the new rules!). I believe networking protocols must be very much the same, so maybe I'd make a good engineer :P.

Revelation:

So, what's the big picture here?

Microsoft's abusive monopoly does not (really) obey the courts and acts as if it is above the law (because they can and do, buy it).

Also, the biggest picture is, Microsoft software is not in the best interest of you, the computing public. It certainly is propitiatory and not conducive to interoperability.

Conclusion: You, (yes you) would be better off using open software and supporting it's faster and most progressive improvement and enjoying it. For POTENTIALLY greater user friendliness, including much easier interoperability.

One of the last advantages of "Windows" is it's number of users, which is dropping. BUT, we can only further facilitate open software adoption (migration and usage) by building a VERY familiar system, without ANY measurable losses AND with OBVIOUS and promoted, EXTRA benefits. We are almost there but we are not there yet! Therefore we need to slow down and major on the majors. We simply need the current XP users. It really that simple. That's the only way that some of the Microsoft aligned hardware device makers will completely help. It the ONLY way to fight the monopoly abuse. Abuse that is only obvious to the technical among us.

You are either a developer who cares (about coding for the 75% of the population; that are not technical) or you are a developer who could care less. We need to step up our game. Now is the time to gather the current XP users. It's not too late; but soon, things could be slowed. If you want to see success, we need a majority. That, is the goal. The competition, is the status quo.

1. Familiarity. (With No losses.) No holds barred.

2. Plus GREAT, easily promoted; exclusive benefits.

Do you have ideas on

Do you have ideas on (technically) why the failure with Samba happened only from Vista running that Excel version? Does that early packet sent happen in other Windows/Excel combinations that you have tried? [I know the problem hasn't been resolved "completely".]

When Microsoft products fail their own documented behavior (was that the case here?) or write docs with gaps or other problems (and hence foil/frustrate interoperability), do you know if there are any EU enforcement actions taken (like fines)?

Bugs will always exist, but that should not be used as an excuse to preserve Microsoft's significant market share.

Mon Dieu!

Good luck finding the second part of the bug. It'll make for interesting reading

Debugging Assistance ?

Great bit of debugging Jeremy. If you would like any assistance in the debugging of the Windows protocols and verifying the Windows protocol documents for this scenario just let me know or drop an e-mail to dochelp@microsoft.com. Nick and I are more than willing to assist. We’ll also look at this scenario in relation to the various existing and under-development protocol documents to see whether any improvements can be made.

Hongwei Sun
Protocol Team
Microsoft

Thanks for the offer of help.

Thanks Hongwei, I'm off home sick at the moment, but I have one more theory to test before I pass it over to you to look at :-). Look for a bug report tomorrow (or at the latest Thursday).
Thanks !
Jeremy.

Same bug, different mechanism?

I've noticed that local (non-ssh) rsync fails to mounted samba shares as it cannot properly determine if the remote file has changed. Files get copied over needlessly. I suspect this might be a more reproducible failure mode for you as it is very likely a similar issue regarding working with file metadata to detect changes. I've seen it on most distros where the command is run on a linux box to copy files to and from a windows one.

More General?

This is a good story, but do you think maybe its a bit unfair to single out CIFS/SMB for this? I think your story generalizes quite well to "Why writing (real software that interacts with real legacy code) is (still) hard"

I mean, say you were trying to implement something as simple as an FTP client that actually worked against all the various FTPd implementations in the world, I bet it would be insanely difficult, and FTP is both an open standard and incredibly feature-poor compared to SMB/CIFS.

MJ

The documentation MS released is garbage. It is not nearly at the level needed for implementation. Some of it is flat-out wrong or misleading.

just my naive thoughts on documentation

I imagine that from an MS perspective goes (for this particular issue), something like "the code is the documentation" might be their explanation. In other words, it's got to be pretty difficult (if not impossible) to document something that has gone through years/decades of change and still maintain a truth to specification. How many human-hours of work has gone into the cifs protocol?

I'm not saying it's right or wrong, just that if you were to document every aspect and nuance to explain behavior of a piece of software, then you're effectively just (re-)writing the software aren't you? In which case you're basically screwed (cynical viewpoint I know). Not that I don't appreciate the work done by the SAMBA team - just that it's got to be painful.

Reminds me of http://www.joelonsoftware.com/items/2008/02/19.html

Best wishes

It is hard to emulate complex systems

I understand how amazingly difficult finding such bugs can be and that's probably why they call programming the art of debugging! However I don't think it is fair to blame MS for everything. Debugging complex software is difficult. Debugging software that emulates another complex software is extremely difficult. The fact that POSIX filesystems do not have an equivalent for every single feature of another filesystem will make life difficult and I doubt even if MS had documented absolutely everything completely and correctly we wouldn't have another hard to find/understand interaction that would be hard to debug. Anyway, it's great you could find and fix the bug!

-AlefSin

Bug in Samba

This is simply a bug in Samba which, like many bugs in Samba, are born from a lack of rigor and/or an inability to work past POSIX.

When a timestamp of a file is modified, the change request must be sent immediately afterwards, before the next acknowledgement of the next request in the stream. To do anything else is silly. The file has been written to; the fact that Samba doesn't actually write to the file is a bug in Samba, plain and simple. In fact, before change notify support was even designed, this behavior should have been characterized by exhaustive automated testing against Windows.

Samba is rife with crap like this, probably because it was written by systems administrators and not professional programmers. I'm not surprised that Sun and Apple are moving away from Samba to their own, scratch built implementations, and other vendors are moving to Likewise.

Let's look at the alternatives.

Firstly, yes it was a premature optimization bug which is now fixed. All software has these things, get over it. The complexity of the change notify interaction with all other parts of the system means that it's not possible to test every possible path through the code. We have a good set of changenotify tests, which of course is being extended to catch this very issue. I can say this because Microsoft also use our tests in the same way that we do, to catch regressions. If their stuff "rife with crap like this" ? (Some would say yes of course :-).

Apple are moving away from any GPL code in their distribution. It matters not how good/bad Samba is to them, so long as it's GPL they're in the process of getting rid of it. They're even moving to a compiler they own. I believe the GPLv3 democratic process scared the life out of them :-).

Sun bought the Procom CIFS stack and are busy adding it to their Solaris kernel. I still doubt the wisdom of adding code the complexity of CIFS into a stable kernel (I doubt the Linux developers would have any of it) and what it will do to their stability. It's still feature poor (no krb5 integration last time I checked), and specific to one platform. Typical Sun, doesn't play well with others. Let's see what Oracle decides to do with it. Some of their developers are very good though, and I'd welcome them working with us on Samba once Oracle brings the axe down on unnecessary projects.

Likewise code, the last time I looked at it, was using the mtime as the create time. This just isn't going to work long term but it's still very much a work in progress - 17 years behind Samba, so let's see if they (or others) can fix it.

Let's look at another competitor, the NetApp CIFS stack. If you look at their performance on the wire you see quite an interesting story. Many of the trans2 calls return with "Unknown level" responses, their change notify story isn't finished and there are several other gaps in their implementation. We aim for completeness, they aim for "what the Windows client market will bear" as a minimal implementation that will sell. They put resources into doing SMB2 ahead of us, but if you don't have the basic protocol correctness then it will eventually bite you hard.

Your last comment I found very amusing, not the least because the only developer we had working on Samba who started as a professional system administrator now works on Likewise full time :-). Still he's a very talented developer, and I miss him.

Jeremy.

Network Engineering?

I realize this will come across as a stupid gripe to a lot of people but here goes anyway:

As a network engineer (Cisco, Juniper, Extreme, Foundry et al.), I was expecting an article on network engineering and how some sort of routing anomaly was causing problems with an application.

What you've described is a complex (and impressively researched) _software_ engineering bug. Just because the software operates over the network, does not make this a network engineering problem. Network engineering refers to the layout, implementation and management of networking equipment- Routers, switches, etc.

Microsoft seems to have started this confusion- you never hear a Unix admin call themselves a network engineer despite the fact that Unix systems were networked long before Windows even existed.

Vista networking

It's not just SAMBA that Vista doesn't play very well with, the entire networking implementation is slow and riddled with annoying bugs which Microsoft don't seem capable of fixing.

Somehow I think that Windows 7 will be exactly the same!

Thank you!

Thank you and to the whole Samba team. You're doing a great job.

Where do I donate (without PayPal)?

Why are you using Windows anyway?

You shouldn't waste time running a server with Windows anyway, it just doesn't cut it compared to real server OSes.

Windows Server makes things slow, inefficient and very, *very* insecure. IIS itself is enough to scare most serious admins away.

Thank's Jeremy....

I,m so happy read your post, I'm not coder...so i'm waitting almost 3 year to konw what happen with SAMBA and excel. 3 yrs ago I deploy file server using open SuSE 10 + SAMBA, my client allways complain coz MS Excel (2002 & 2003) suddenly close without any error warning when open and working with excel file from server.... Until I must make hard descition to move to windows 2003 server to solve the problem (it's solve with w2k3). I hope Samba Team will release new samba server soon. I will send my support to samba team....Great work.

Best Regard
Darma Yasa, Putu Teguh - Bali Island Indonesia

Back to top