What this course is NOT about: How to use Windows (or Mac OS)? Learn a particular operating system What this course is about: What are operating systems? What do operating systems do? How do operating systems do what they do? What are the open issues?
The main topics this course deals with: Process management Memory management File systems I/O resource management User management Security and Reliability issues
1. Mars Pathfinder Priority Inversion (process management) Incident On July 4th, 1997, the Mars Pathfinder landed to a media fanfare and began to transmit data back to Earth. Days later and the flow of information and images was interrupted by a series of total systems resets.
2. Race Condition Bug Creates Blackout for 50 Million (process management, security) On August 14, 2003, a blackout across eight US states and Canada affected 50 million people. PC Authority described the cause, a race condition bug, as something that occurs when “two separate Threads of a single operation use the same element of code.” Without proper synchronization, the threads tangle and crash a system. That’s what happened here with the result 256 power plants offline. The major disruptions manifested themselves in the form of cellular communication with the best form of communication during the outage said to be a laptop using a dial-up modem. And if you just cringed in horror at the word “dial-up,” you’re not alone.
3. Patriot Missile System Timing Issue Leads To 28 Dead (Real-Time OS issue) The most tragic computer software blunder on our list occurred on February 25, 1991, during the Gulf War. While the Patriot Missile System was largely successful throughout the conflict, it failed to track and intercept a Scud missile that would strike an American barracks. The software had a delay and was not tracking the missile launch in real time, thus giving Iraq’s missile the opportunity to break through and explode before anything could be done to stop it, according to the US Government Accountability Office. In all, 28 were killed with an additional 100+ injured.
4. Year 2038 (security and reliability) Although Y2K is passed, we’re not out of the woods just yet. Not all computers handle dates in the same way, and many computers based on the UNIX operating system handle dates by counting how many seconds a date is since 01/01/1970. For example, the date 01/01/1980 is 315,532,800 seconds after 01/01/1970. This number is stored on these computers as a “signed 32-bit integer”, which has a size limit of 2147483647. That basically means it can only handle dates that are up to 2147483647 seconds after 01/01/1970 – which only takes us up to the 19th of January 2038, after which, we may have problems again. This is especially true when we consider that UNIX-based software is more commonly used in “embedded systems” rather than a home PC – that is, systems that have a very specific purpose closely related to their hardware, such as software for robotic assembly lines, digital clocks, network routers, security systems and so on. Also, somebody is going to have to consider what we’re going to do on the 1st of January 10000. Not me though.
due to inadequate OS design (security and reliability) 5. Loss of Communication between the FAA Air Traffic Control Center, and Airplanes due to inadequate OS design (security and reliability) On Tuesday, September 14, 2004, the Los Angeles International Airport, and other airports in the region suspended operations due to a failure of the FAA radio system in Palmdale, California. Technicians onsite failed to perform the periodic maintenance check that must occur every 30 days, and the system shut down without warning as a result. The controllers lost contact with the planes when the main voice communications system shut down unexpectedly. Compounding this situation was the almost immediate crash of a backup system that was supposed to take over in such an event. The outage disrupted about 600 flights (including 150 cancellations), impacting over 30,000 passengers. Flights through the airspace controlled by the Palmdale facility were either grounded, or rerouted elsewhere. Two airplane accidents almost occurred, and countless lives were at risk. A bug in a Microsoft system compounded by human error was ultimately responsible for the three-hour radio breakdown. A Microsoft-based replacement for an older Unix system needed to be reset approximately every 50 days to prevent data overload. A technician failed to perform the reset at the right time, and an internal clock within the system subsequently shut it down. In addition, a backup system also failed. When a system has a known problem, it is never a good idea to continue operation. Instead of relying on an improvised workaround, the bug in the software should be corrected as soon as possible to avoid a potential crisis. Learning from this experience, the FAA deployed a software patch which now addresses this issue.
6. Intel CPU causes OS bugs (reported on Jan 6. Intel CPU causes OS bugs (reported on Jan. 3, 2018) (security and reliability) The exact bug is related to the way that regular apps and programs can discover the contents of protected kernel memory areas. Kernels in operating systems have complete control over the entire system, and connect applications to the processor, memory, and other hardware inside a computer. There appears to be a flaw in Intel’s processors that lets attackers bypass kernel access protections so that regular apps can read the contents of kernel memory. To protect against this, Linux programmers have been separating the kernel's memory away from user processes in what’s being called “Kernel Page Table Isolation.”