Age | Commit message (Collapse) | Author |
|
This filesystem is based on the code of the long-lived TmpFS. It differs
from that filesystem in one keypoint - its root inode doesn't have a
sticky bit on it.
Therefore, we mount it on /dev, to ensure only root can modify files on
that directory. In addition to that, /tmp is mounted directly in the
SystemServer main (start) code, so it's no longer specified in the fstab
file. We ensure that /tmp has a sticky bit and has the value 0777 for
root directory permissions, which is certainly a special case when using
RAM-backed (and in general other) filesystems.
Because of these 2 changes, it's no longer needed to maintain the TmpFS
filesystem, hence it's removed (renamed to RAMFS), because the RAMFS
represents the purpose of this filesystem in a much better way - it
relies on being backed by RAM "storage", and therefore it's easy to
conclude it's temporary and volatile, so its content is gone on either
system shutdown or unmounting of the filesystem.
|
|
When doing PT_SETREGS, we want to verify that the debugged thread is
executing in usermode.
b2f7ccf refactored things and flipped the relevant check around, which
broke things that use PT_SETREGS (for example, stepping over
breakpoints with sdb).
|
|
|
|
Using this abstraction it is possible to compile this file for aarch64.
|
|
This was using the x86_64 specific cpu_flags abstraction, which is not
compatible with aarch64.
|
|
|
|
|
|
|
|
The handling of page tables is very architecture specific, so belongs
in the Arch directory. Some parts were already architecture-specific,
however this commit moves the rest of the PageDirectory class into the
Arch directory.
While we're here the aarch64/PageDirectory.{h,cpp} files are updated to
be aarch64 specific, by renaming some members and removing x86_64
specific code.
|
|
Various places in the kernel were manually checking the cs register for
x86_64, however to share this with aarch64 a function in RegisterState
is added, and the call-sites are updated. While we're here the
PreviousMode enum is renamed to ExecutionMode.
|
|
This header has always been fundamentally a Kernel API file. Move it
where it belongs. Include it directly in Kernel files, and make
Userland applications include it via sys/ioctl.h rather than directly.
|
|
Reduce inclusion of limits.h as much as possible at the same time.
This does mean that kmalloc.h is now including Kernel/API/POSIX/limits.h
instead of LibC/limits.h, but the scope could be limited a lot more.
Basically every file in the kernel includes kmalloc.h, and needs the
limits.h include for PAGE_SIZE.
|
|
I wrote that file in 2022, not Andreas in 2018.
|
|
|
|
Before this patch, Core::SessionManagement::parse_path_with_sid() would
figure out the root session ID by sifting through /sys/kernel/processes.
That file can take quite a while to generate (sometimes up to 40ms on my
machine, which is a problem on its own!) and with no caching, many of
our programs were effectively doing this multiple times on startup when
unveiling something in /tmp/session/%sid/
While we should find ways to make generating /sys/kernel/processes fast
again, this patch addresses the specific problem by introducing a new
syscall: sys$get_root_session_id(). This extracts the root session ID
by looking directly at the process table and takes <1ms instead of 40ms.
This cuts WebContent process startup time by ~100ms on my machine. :^)
|
|
We really don't want callers of this function to accidentally change
the jail, or even worse - remove the Process from an attached jail.
To ensure this never happens, we can just declare this method as const
so nobody can mutate it this way.
|
|
Allow sending `SIGCONT` to processes that share the same `pgid`.
This is allowed in Linux aswell.
Also fixes a FIXME :^)
|
|
There are places in the kernel that would like to have access
to `pgid` credentials in certain circumstances.
I haven't found any use cases for `sid` yet, but `sid` and `pgid` are
both changed with `sys$setpgid`, so it seemed sensical to add it.
In Linux, `man 7 credentials` also mentions both the session id and
process group id, so this isn't unprecedented.
|
|
This step would ideally not have been necessary (increases amount of
refactoring and templates necessary, which in turn increases build
times), but it gives us a couple of nice properties:
- SpinlockProtected inside Singleton (a very common combination) can now
obtain any lock rank just via the template parameter. It was not
previously possible to do this with SingletonInstanceCreator magic.
- SpinlockProtected's lock rank is now mandatory; this is the majority
of cases and allows us to see where we're still missing proper ranks.
- The type already informs us what lock rank a lock has, which aids code
readability and (possibly, if gdb cooperates) lock mismatch debugging.
- The rank of a lock can no longer be dynamic, which is not something we
wanted in the first place (or made use of). Locks randomly changing
their rank sounds like a disaster waiting to happen.
- In some places, we might be able to statically check that locks are
taken in the right order (with the right lock rank checking
implementation) as rank information is fully statically known.
This refactoring even more exposes the fact that Mutex has no lock rank
capabilites, which is not fixed here.
|
|
Check if the process we are currently running is in a jail, and if that
is the case, fail early with the EPERM error code.
Also, as Brian noted, we should also disallow attaching to a jail in
case of already running within a setid executable, as this leaves the
user with false thinking of being secure (because you can't exec new
setid binaries), but the current program is still marked setid, which
means that at the very least we gained permissions while we didn't
expect it, so let's block it.
|
|
These are architecture-specific anyway, so they belong in the Arch
directory. This commit also adds ThreadRegisters::set_initial_state to
factor out the logic in Thread.cpp.
|
|
No functional change.
|
|
|
|
We add this basic functionality to the Kernel so Userspace can request a
particular virtual memory mapping to be immutable. This will be useful
later on in the DynamicLoader code.
The annotation of a particular Kernel Region as immutable implies that
the following restrictions apply, so these features are prohibited:
- Changing the region's protection bits
- Unmapping the region
- Annotating the region with other virtual memory flags
- Applying further memory advises on the region
- Changing the region name
- Re-mapping the region
|
|
This syscall will be used later on to ensure we can declare virtual
memory mappings as immutable (which means that the underlying Region is
basically immutable for both future annotations or changing the
protection bits of it).
|
|
This patch validates that the size of the auxiliary vector does not
exceed `Process::max_auxiliary_size`. The auxiliary vector is a range
of memory in userspace stack where the kernel can pass information to
the process that will be created via `Process:do_exec`.
The reason the kernel needs to validate its size is that the about to
be created process needs to have remaining space on the stack.
Previously only `argv` and `envp` were taken into account for the
size validation, with this patch, the size of `auxv` is also
checked. All three elements contain values that a user (or an
attacker) can specify.
This patch adds the constant `Process::max_auxiliary_size` which is
defined to be one eight of the user-space stack size. This is the
approach taken by `Process:max_arguments_size` and
`Process::max_environment_size` which are used to check the sizes
of `argv` and `envp`.
|
|
Some programs explicitly ask for a different initial stack size than
what the OS provides. This is implemented in ELF by having a
PT_GNU_STACK header which has its p_memsz set to the amount that the
program requires. This commit implements this policy by reading the
p_memsz of the header and setting the main thread stack size to that.
ELF::Image::validate_program_headers ensures that the size attribute is
a reasonable value.
|
|
While this isn't really POSIX, it's needed by the Zig port and was
simple enough to implement.
|
|
This copies and adapts the setresgid syscall, following in the footsteps
of setreuid and setresuid.
|
|
Co-Authored-By: Daniel Bertalan <dani@danielbertalan.dev>
|
|
Now with the ability to specify different bases for the old and new
paths.
|
|
|
|
Co-Authored-By: Daniel Bertalan <dani@danielbertalan.dev>
|
|
Co-Authored-By: Daniel Bertalan <dani@danielbertalan.dev>
|
|
|
|
Previously we tried to determine if `fd` refers to a non-regular file by
doing a stat() operation on the file.
This didn't work out very well since many File subclasses don't
actually implement stat() but instead fall back to failing with EBADF.
This patch fixes the issue by checking for regular files with
File::is_regular_file() instead.
|
|
This syscall doesn't need to do anything for ENOSPC, as that is already
handled by its callees.
|
|
To accomplish this, we add another VeilState which is called
LockedInherited. The idea is to apply exec unveil data, similar to
execpromises of the pledge syscall, on the current exec'ed program
during the execve sequence. When applying the forced unveil data, the
veil state is set to be locked but the special state of LockedInherited
ensures that if the new program tries to unveil paths, the request will
silently be ignored, so the program will continue running without
receiving an error, but is still can only use the paths that were
unveiled before the exec syscall. This in turn, allows us to use the
unveil syscall with a special utility to sandbox other userland programs
in terms of what is visible to them on the filesystem, and is usable on
both programs that use or don't use the unveil syscall in their code.
|
|
We were only updating the tv_sec field and leaving UTIME_NOW in tv_nsec.
|
|
We now disallow jail creation from a process within a jail because there
is simply no valid use case to allow it, and we will probably not enable
this behavior (which is considered a bug) again.
Although there was no "real" security issue with this bug, as a process
would still be denied to join that jail, there's an information reveal
about the amount of jails that are or were present in the system.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Our implementation for Jails resembles much of how FreeBSD jails are
working - it's essentially only a matter of using a RefPtr in the
Process class to a Jail object. Then, when we iterate over all processes
in various cases, we could ensure if either the current process is in
jail and therefore should be restricted what is visible in terms of
PID isolation, and also to be able to expose metadata about Jails in
/sys/kernel/jails node (which does not reveal anything to a process
which is in jail).
A lifetime model for the Jail object is currently plain simple - there's
simpy no way to manually delete a Jail object once it was created. Such
feature should be carefully designed to allow safe destruction of a Jail
without the possibility of releasing a process which is in Jail from the
actual jail. Each process which is attached into a Jail cannot leave it
until the end of a Process (i.e. when finalizing a Process). All jails
are kept being referenced in the JailManagement. When a last attached
process is finalized, the Jail is automatically destroyed.
|
|
This function is already serialized by the address space lock.
|