As I mentioned in the last report, I had been writing tests to find the problems in SHFileOperation. All I knew was that in some situations it would end without fully writing some file, so I wrote tests moving, copying and deleting a folder structure, and comparing the copy with the original. I tried with empty files first, then with files of random sizes, then with really big files, and also with a lot of files of relatively small size. In all my tests, I never once found a file that didn’t appear to be correct. The folder names, folder contents, file sizes, file attributes, … everything matched.
Being unable to reproduce the issue, I started to wonder if it wouldn’t be best to follow Huw’s idea, which would have consisted on implementing the NT6 method for file operations. In Vista and up, there is an IFileOperation interface, which can be used in place of the SHFileOperation function, and support a much more flexible means of listing operations and assigning a dialog used to track them. Then, the SHFileOperation call could be rewritten as a wrapper to IFileOperation. This would provide the benefit of being ready for a future where ReactOS implements Vista+ user mode, while at the same time allowing a more structured (and, by extension, robust) implementation of the file operations.
So after confirming that Windows 2003 didn’t have a similar internal interface (it does not appear to use any objects at all in the implementation, as far as I could tell), I began looking into this interface and how my installation of Windows 7 uses it when calling SHFileOperation. This is when I realized that the CFileOperation object, which is the name of the class that implemented IFileOperation in Vista+, has some internal methods specifically designed to be used by SHFileOperation, instead of using the public interface. This kept me wondering which of the three options I should follow. The first option would have been to fix the existing SHFileOperation code, potentially running into problems in the future when trying to add progress dialogs to it. The second and third options would be to rewrite the file operations using the IFileOperation interface, but in the second, I’d make SHFileOperation a wrapper to IFileOperation, while in the third I’d follow Microsoft’s choice and add some internal methods to make the implementation simpler.
While I was weighting the options but before I had a chance to decide, Giannis messaged me and said I should drop this task for now, and focus on the bigger stability and reliability issues: leaks and crashes.
In this topic, we have two major issues unfixed: the menu leaks when navigating in the file browser windows, and the COM object leaks from shell menus such as the start menu. The former isn’t too big, but it does mean that after a few hundred navigations, the desktop heap will fill up and the win32k’s user32 module will not be able to allocate any more objects. The latter is worse.
To understand this issue, you need to know that all COM objects have a reference counter. This counter is incremented when calling AddRef, and decremented when calling Release. Some methods such as QueryInterface or GetSite, will implicitly call AddRef for you, but they leave you the task of calling Release when you no longer need the object.
The shell menus are composed of their own hierarchy of objects. The outermost object is the CMenuDeskBar, which is the object that implements the outer window frame, and connects the menu with the parent (start button, menubar, or another menu band). Inside the deskbar, there is a CMenuSite, which interfaces between the deskbar and the content of the menu. Inside the menu site, there is a CMenuBand, which is a menu object used in both popup menus, and the menubar in explorer windows. Because the deskbar needs to know its child site, it has a reference to it, and such causes the site’s refcount to be incremented by 1. Similarly, the site needs to know the menu band, so it increments the menuband’s refcount by 1, to mark it as being used. But here comes the interesting part. In order to communicate with the parent deskbar, the menu band has to know its parent site, which in turn has to know its parent deskbar. This means that this group of objects has a double circular reference (deskbar<->site<->band). This is not such a bad thing on its own, and it happens often in COM object structures. But the important thing to notice is that unless there is a code path that tells the deskbar, the site, and/or the band to release their references, the loop is never going to break, and the reference counts are never going to reach 0, meaning the objects will never be destroyed. And yes, this implies all the system resources used by those objects (windows, toolbars, fonts and icons) are never being released.
As you can imagine, the lack of such a code path is a really bad problem, since the handle counters grow by nearly a dozen every time a shell menu pops up. The problem is, we don’t know what codepath is the one supposed to break the circle, and there isn’t any obvious function called “Destroy” or similar, so finding the right functions to call and the right place to call them will require some very long and boring debugging sessions. That’s why I always avoided the dreadful task, until now.
I have been
delay planning how to approach the issue, and starting on Monday I will have to get to it, one way or another, and fix both the menu leaks and the COM circular references.