UNIX/Linux Server Support Engineering: Frequently Asked Questions : UNIX/Linux

Q:What is virtual Memory?

A:Virtual memory is an implementation of technique that used for multitasking kernel and it's part of OS. This implementation virtualizes hardware memory devices such as RAM and Hard Disk to manage applications to relaative addressing.

Q:what is memory pressure?

A:This is a term to describe a condition or state of an operating system where most of the memory has been used, but it doesn’t mean that we are really out of memory. Just that there is now an urgent need to begin to release/swap some memory areas that we currently do not need because there is an application now requesting more memory than the one available. It may be the policy of an OS (and it is of Linux) to make the most use of resources (especially the scarce ones like main memory) and use as much memory as possible for buffers, kernel caches in order to speed up things. E.g. we may want to save in RAM one or more block read from disk, although a process only needs a few bytes of it - because it may need the rest shortly and we will not have to repeat an expensive I/O operation. This is the case with Linux that actually uses free memory as a buffer for hard drive access. When it faces a memory-pressure situation, Linux first will try to reduce the buffer space and only when this is not enough it will fall-back on paging to disk parts of a process memory that are inactive because this is slower. Even more rarely Linux will swap an entire process to disk, since resuming such a process will be resource-intensive.

There are two types of memory pressure a process can be exposed to external and internal. To maximize its performance and reliability a process might want to react to both of them. External memory pressure might cause a process and whole system go into paging . Internal memory pressure might cause out of memory conditions and eventual process's crash.

External memory pressure is controlled by OS. There are two types of external memory pressure such as physical dynamic memory pressure and physical “static” memory pressure. The latter type happens when a system runs out of page file. This type of memory pressure might drive the whole system into out of memory condition. You might have seen those pop ups in the right corner indicating that system runs low on virtual memory. In order to detect this type of pressure one needs to monitor the size of page file. Usually applications don’t do it.

Q:Is there a command that would show the system to be in such a state?
The command "vmstat" will be helpfull to get usage and details of memory.

A:Virtual Memory STATistics. Average since last boot:
$ vmstat

The situation every 1 seconds:
$ vmstat 1 (ctrl-c to stop output)

The most interesting columns in the output are:
-for central Memory
free the amount of idle (free) memory

-for disk
si: Amount of memory swapped in from disk (/s) - aka page-ins
so: Amount of memory swapped to disk (/s) - aka page-outs

As a rule of thumb if free memory is less than 5% of the total memory in more than 10 samples per 100 we could say that the system is under “memory pressure". Adding more memory and/or better distributing the workload during the day may be a cure here.

Q:what is a page fault?

A:A page fault is an interrupt to the software raised by the hardware, when a program accesses a page that is mapped in address space, but not loaded in physical memory. It occurs when the kernel needs a page of memory but this page does not exist in RAM, because it has been written to disk (we say “paged-out”). Thus it has to be re-read from disk (a “page-in” operation).

Q:what is the oom-killer and when is it triggered?

A:OOM-killer is out of memory process killer and it is related to overcommitment of the memory. The oom-killer starts killing processes in order to free some memory up. The default behaviour of the oom-killer that which process have to start to kill is defined in sysctl(8) parameter vm.overcommit_memory. Major distribution kernels set the default value of /proc/sys/vm/overcommit_memory to zero, which means that processes can request more memory than is currently free in the system.

If a system is brought to its knees, intense paging is causing performance to suffer and everything else fails, Linux will kill some processes (OOM stands for Out of Memory Killer) to keep the machine up. It may seem terrible, but in such severe situations the alternative would be to panic or lock up the system, not better. It is a sacrifice done in order to free up memory for the system to be usable and actually do something :) Linux will try to find the best candidate for elimination and here are the heuristic criteria used (from some kernel source code comments):

1) we lose the minimum amount of work done
2) we recover a large amount of memory
3) we don’t kill anything innocent of eating tons of memory
4) we want to kill the minimum amount of processes (one)
5) we try to kill the process the user expects us to kill, this algorithm has been meticulously tuned to meet the principle of least surprise ... (be careful when you change it)

Q:Can the sysadmin force the OOM killer to start?

A:Of course, by sending the Magic SysReq command “f”. It could be a better alternative to a hard reboot. Usually preceded by a SysReq “m” to dump some kernel memory stats. Alternatively, sysadmin can do that by creating out of memory conditions and the oom killer policy can be handled and managed by sysadmins defining the priority of processes. I mean, it can be configured.

Q:Can the kernel swap to a ramdisk?

A:Yes it can. This can be done by using sysctl vm.swappiness. The vm.swappiness is a tunable kernel parameter that controls how much the kernel favors swap over RAM (using swap tendency by assigning mapped ratio etc.).

Q:How can a 32-bit system use more than 4GB of RAM?

A:That can be done by installing PAE (Physical Address Extention) enabled kernel. That will allow us to use higher then 4GB of RAM. An OS can do that through virtual memory provided that there is some kind of hardware support because the 4 GB limit only refers to the amount of memory *directly* addressable with 32-bit. A single application can continue to use 32-bit virtual addresses and thus will be still limited to 4G, but more than one application can use more than 4 GB altogether, because the OS can map each 4GB virtual address space in a larger physical memory space.

E.g. since 1995 Intel processors, starting from the Pentium Pro, incorporate a feature called PAE (Physical Address Extension) that augments the address lines from 32 bits to 36 bits, thus allowing 16*4GB=64GB of memory to be addressed (at least theoretically, in practice it is a bit less because there are some memory-mapped devices). The Linux kernel supports PAE as a build option and most of Linux distributions are already PAE enabled.

Q:How does one list the pci peripherals connected to the system?

A:By performing the "lspci" command will list the pci peripherals connected.
lspci(8). I often use:
$ lspci | grep -i net

to identify the wireless interface,
$ lspci | grep VGA
for the graphics card, etc.

Some other methods can be used as well, for example:
"lshal" (hardware abstraction layer) would be on of the better solution to get the entire connected devices/peripherals to the system. Alternatively we can use "lsdev" for the same reason. There are several method/commands to do that.
There are also:
$ cat /proc/cpuinfo
to find CPU specification
$ cat /proc/meminfo = memory information
$ lsmod = list of loaded kernel modules
$ dmesg = examine the kernel ring buffer

If one has root privileges:
# dmidecode
to dump BIOS info and overall hardware in a system
# hdparm -I /dev/sda (repeat for sdb, sdc, etc)
detailed IDE/SATA HDD info

Q:What is a filesystem and what are the main structures used in?

A:A file system is a structure used to organize and access files on a storage device; the most popular storage device is the hard disk. It is a kind of database that stores files (both file metadata and data), usually in secondary memory and organizes them in a hierarchical structure of nested directories for easy retrieval by humans.

The basic purpose of a file system is to organize files and provide access to those files. The primary organization method is called the hierarchical structure. The entry point to this structure is called the root or top-level. The two basic elements of the structure are folders and files. Folders are used to hold more files and folders. By starting at the root folder and opening a folder in the root, you are moving one level deeper into the hierarchy.
File, directories (in Unix a directory is just a special kind of file containing a list of file names and respective inode numbers), metadata structures (in Unix these are called inodes and store all the info about a certain file/directory or other object in the file system except its data and name), data storage structures (called clusters or blocks and consisting usually of a fixed number of disk sectors) and indexes to allow both quick random access to an arbitrary block in a particular file (from simple linked lists to combined B-trees and hashes as in ext3) and to quickly translate between names and inodes (Linux uses a cache for that).

Q:Do all filesystems implement all the file and directory access system calls? When making a directory for instance, do filesystems share the same code in the syscall?

A:The system call is the fundamental interface between an application and the kernel, so the answer is yes. Filesystem will share the same code (inode) in the syscall if the directory instanced hardly (hard link) linked, no in case of symbolic link instances.

Linux supports a very wide range of filesystems, both old and new types and some more exotic e.g. for clustering, cryptography. There are virtual FSs as well, such as /proc. One can even create his own FS by writing code in user space only (see the FUSE project).

Of course it won’t be possible to support many different types of FSs and target storage devices transparently to applications without some sort of abstraction and layered architecture. The Virtual File System (VFS) in the Linux kernel defines a common set of API functions that every FS has to implement so that for example the mkdir(2) function call doesn’t have to be aware of the file system types or the particular storage medium upon which the file system is mounted. User applications deal with a consistent generic system call interface rather than having to worry about differences in individual FS implementations. Anyway a filesystem doesn’t have to implement all the FS calls found in the GNU C Library. Let’s say that I develop an awkward FS where there is no concept of directories, trying to call the mkdir(2) function on it will return the EPERM error code.

Q:Is it possible to have a hard link spanning two files in different filesystems and what is the reason?

A:The answer is definitely no. Hard links cannot cross file system boundaries. This is because all hard links are based upon inode numbers and nothing else. In other words, as shown by an "ls -i" two or more hard links to the same file share the same inode number. Inodes are only guaranteed to be unique on a single filesystem. If Linux allowed to create an hard link in a different filesystems there would be ambiguity about which filesystem the inode number belongs to. Hard link can not span filesystems because an inode number is meaningless outside of the inode's own filesystem. But we can do that with symlink.

Q:Can we use 8k block ext3 filesystems on x86 machines?

A:Of course not. On x86, a filesystem block is just about always 4KiB, the default size and never larger than the size of a memory page. It is not possible to have block size greater then 4KiB. It is possible to extend in 8KiB of block size on Intel Itanium and other architectures that support 8KiB.

Q:What are the reasons why some applications insist on accessing raw devices?

A:The complex applications like database management systems (DBMS) do usually their own caching. So they need to access to the raw devices which can be used to perform raw I/O with existing block devices by bypassing the caching that is normally associated with block devices (kernel). We can say also that raw devices are suitable for applications like DBMS. So, the applications like RDBMS are known to utilize raw disk directly instead of file system storage, mostly for bypassing the system buffer cache, because they already implement their own cache in a way that better tailored to the application needs. If these applications are well engineered they can perform better on a raw file system, e.g. better throughput. The drawback is that raw disk partitions cannot be managed by the OS (e.g. with Unix shell commands): only the application knows about the format and for the OS it is just a bunch of contiguous disk blocks, without any known structure. Another reason could be that an application would like to implement its own kind of network file system for accessing the same data concurrently from many hosts.

Q:What type of file system would be most suitable for a character device?

A:Raw filesystems because character devices are read and written directly without buffering and this is exactly why one usually wants to use a raw filesystem.
On the contrary block devices can only be read/written via the buffer cache and in fixed block sizes or multiples of the block size. A raw device is seen by Linux as a character device but it is actually bound to an existing block device (usually a disk).

Q:What is an access control list? How is it used in Linux?

A:An access control list (ACL) is a table that defines the user privileges policy of the operating system. It defines which access rights each user has to a particular system object, such as a file directory or individual files. The most common privileges are the ability to read/write/execute data/file/directories (rwx). Shortly is know as filesystem permission access privilege list and entries. Each file or directory can have an associated ACL that lists the permission rules to be applied to it. Each of the rules within an ACL is called an access control entry, or ACE. In general, an access control entry identifies the user or group to which it applies and specifies a set of permissions to be applied to those users. ACLs have no set length and can include permission specifications for multiple users or groups.

Traditionally, Unix allows the assignment of permissions (rwx plus some special flags SUID, SGID and sticky bit), for three user groups only (classes of users): the file owner, the owning group and all other users. This is sufficient in most situation but more complex permission models may require the assignment of permission to individual users or groups even if these do not correspond to the owner or the owning group. Such fine-grained control is provided by POSIX ACL, and for example well supported by Linux 2.6 and the Ext3 FS.

Here is some example performed on my tablet:
$ sudo -s
# cd /tmp
# dd if=/dev/zero of=partition.img bs=1k count=10000
# losetup /dev/loop0 partition.img
# mkfs.ext3 -c /dev/loop0 10000
# mkdir partition
# mount -t ext3 -o acl /dev/loop0 partition
# cd partition
# cat>data
I am Debian fan and Free Software promoter
^D
# chmod go= data.csv
# getfacl data.csv
# file: data.csv
# owner: root
# group: root
user::rw-
group::---
other::---
# setfacl -m u:myfriend:r-- data.csv
# getfacl --omit-header data.csv
user::rw-
user:myfriend:r--
group::---
mask::r--
other::---
# cd ..
# umount partition
# losetup -d /dev/loop0
^D

Q:What is the main binary executable format used in linux and the most important components of this format?

A:The standard ELF (Executable and Linkable Format). It is used for both regular binary executables, shared libraries, object code (.o files) and core dumps (e.g. useful for post-mortem debugging), because their internal structure is quite similar. An ELF file starts with an ELF header, followed by an image header and the actual data. This is divided into sections, a concept familiar to any Assembly language programmer.

section type of information
.text executable code
.data initialized data variables
.bss unitialized data

I remember the old MS-DOS COM format jumbled all together with no structure. At that time there was no memory protection and program executable code could modify itself - a "feature" not very useful in common programming but exploited mostly by mutant viruses :) Now the Linux kernel loads all .text sections into memory pages marked as read-only while .data section use read-write memory pages. .bss sections do not take up much space in the executable because they only declare variables so that the kernel loader will know how much space reserve for them. Typically each ELF file includes a symbol table that contains important data for linking, relocation and debugging (although the latter is optional and can be stripped out to reduce the executable size).

For example I can use a function defined not in my code, but in an external shared library - a very common situation, e.g. I am using printf(3) from the standard C I/O library to output a string. When I compile my code I still cannot know the final address to use with the assembly-level CALL instruction. I do not even know the absolute addresses of the operand(s) that must be pushed onto the stack before the call is made (assuming parameters are not all constant). But I can put aside all the information needed to compute these addresses later on, e.g. type of relocation, which symbol is being referenced, relative addresses of operands in one of the .data or .bss sections, etc. The kernel loader will use this information to resolve all the symbols and generate a usable executable in memory. Summarizing the machine code found into executables is not in a ready-to-run state, there are still some references that could only be resolved when loading the program in memory.

Q:what is linking?

A:Linking is the process of combining various pieces of code and data together to form a single executable that can be loaded in memory. Linking can be done at compile time, (by linkers, e.g. the GNU ld), at load time (by loaders, usually part of an OS kernel) and also at run time (by application programs). During the early days of computing it was done manually :)

A considerable overlap exists between the functions of linkers and loaders. One way to think of them is: the loader does the program loading from HD to RAM, allocates storage space and maps virtual disk pages to virtual addresses; the linker does all the symbol resolutions that could be done at compile-time; and either of them can do the relocation and merging of all sections of the same type.

Q:what is a shared library?

A:The shared library is a program/library that allow executables to dynamically access external functionality at run time and thereby reduce their overall memory footprint by bringing functionality in when it's needed. In the Linux library hierarchy there are two different libraries which are Static and Shared libraries. The Shared Libraries can be used in two different way, Dynamic Linking and Dynamic Loading (run time used under control program).

Shared library are an improvement over static libraries. The latter allows code reuse but requires the linker to extract all the library functions used and make them part of the executable, which can make it rather large. On the contrary when linking your program against a shared library no code is copied and pasted from the library into your binary program, just information for the loader are saved. The loader will add all the shared-library functions used by your program into its address space at run-time, not at compile-time as happens with static libraries. As the name implies, shared libraries are actually shared by multiple programs and can be easily upgraded centrally.

Linking, relocation, static/shared libraries and shared code segments are all a complication, but necessary to support code reuse, keep executable size low and save computational resources.

Q:How can one see all the shared libraries which a binary is linked to?

A:When we are starting an application on our operative system the shared libraries invoke an ELF image and then the kernel begins with the process of loading the ELF image into our user space virtual memory. Then the kernel notices an ELF section called ".interp", which indicates the dynamic linker to be used (/lib/ld-linux.so which is itself also is a shared library but in this state it is statistically compiled and has no shared library dependencies). So, if we would like to list all the shared libraries and the correspondence linked binary we can run the command "ldd /path/of/binary" to list dynamic dependencies of executable files or shared objects.
E.g. ldd -v /usr/bin/vi

Q:By removing (accidentally) all the links in the /lib directory leaving the binaries will cause many applications loading problem. What is the best way of finding the links through which a library should be accessed and how can one recover from this situation?

A:Inspecting another similar working system and recreating the links by hand may help the budding sysadmin here, although it is not the fastest way :) (see next question). /lib holds only those libraries necessary to boot the system and to run the commands in the root file system, it’s not a long list to go through. In fact to keep the root partition small, most libraries are put in /usr/lib

Anyway the run time loader finds the library by its "soname" which includes only the major version number (for example, "libfoo.so.1"). Therefore, a new version of the library can be installed, and existing programs will use it automatically. Of course, it is critical to change the major version number if calling sequences change in an incompatible way. Several libraries with different major version numbers can be installed at once, and in fact need to be, until all programs using the library have been recompiled..

There should also be a symbolic link with no version number (for example, "libfoo.so") which is used at compile time to find the current version.

The "ldconfig" creates a symbolic link with the soname pointing to the current version of the shared library. Therefore it would suffice to run for recovering:
# ldconfig /lib

Q:Is it possible to override the functions defined in a shared library when running an application?

A:The current Linux shared libraries are much more flexible and sophisticated that permit us to override the specific functions in a library when executing a particular program. It can be done without messing up with the library source code or having root permissions in order to install a patched version of the library! And it makes sense to do that, e.g. for debugging purposes or transparent extensions.
It is a feature of the GNU linker ld(1) well explained here:
http://www.ibm.com/developerworks/linux/library/l-glibc.html

Q:What kind of memory protection is needed in order for the operating system to correctly implement shared libraries?

A:Shared libraries are designed with a technique for placing library functions into a single unit that can be shared by multiple processes at run time. This technique save both disk space and RAM. Then, I think the PROT_READ (which is mean to mark code pages as read-only) can be suitable protocol to read the contents of the memory region in order for the operating system to correctly implement shared libraries. But I am not sure about as I had no chance to deal to much with shared libraries in my experience.

Q:What is the meaning of an unfinished syscall?

A:When the system call is being executed and meanwhile another one is being called from a different thread/process then strace try to preserve the order of those events and mark the ongoing call as being unfinished.

Q:How can one tell exactly where the process is stuck, or how to debug the problem further?

A:A process is said to be "stuck" when it cannot proceed because it is waiting for an event that cannot, or does not, occur. So, if we want to find where the process is stuck we should put that in debig mode by creating a break points. We might want to consider running it from a debugger, instead of trying to attach to it at runtime.

Q:Superuser runs 'sync' on a linux system, but this command never returns, doing 'ps auxw | grep sync' the sysadmin notices that it is in 'D' state. Can the sysadmin kill this process? The sysadmin tried to strace the process, which only showed the unfinished sync() syscall.

A:The process with flag D is uninterruptable sleep and basically can not be killed by users and/or admins. Status uninterruptable mean that process performing so-called critical task, the signals do not stop the process or alter the behavior and it mean also that the process holding a semaphore or a critical system resources. The only way to kill the process in state D is reboot of the machine.

Q:The sysadmin rebooted the system, and now the boot loader is not working properly, and GRUB complains about a problem at stage 1.5. What should one do?

A:The GRUB problem at stage 1.5 is one of the most common problem that the grub has lost it configuration and there are several way to restore it to back. So, one the solution can be overwrite the MBR which will not cause any damage to our system installed in. We can do that by using any live cd or usb pen drive then chroot on filesystem where grub is located.

UNIX/Linux Server Support Engineering

Saturday, July 30, 2011

Frequently Asked Questions : UNIX/Linux

No comments:

Post a Comment