====================================================== Advanced Configuration Options for YARN-Based Clusters ====================================================== The following sections explain a few advanced configuration options. Configuring Memory, CPU, and YARN CGroups ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Memory and CPU related configuration properties must be modified as per your cluster configuration and requirements. 1. Memory ``yarn.memory`` in ``resources.json`` declares the amount of memory to ask for in YARN containers. It should be defined for each component, COORDINATOR and WORKER based on the expected memory consumption, measured in MB. A YARN cluster is usually configured with a minimum container allocation, set in ``yarn-site.xml`` by the configuration parameter ``yarn.scheduler.minimum-allocation-mb``. It will also have a maximum size set in ``yarn.scheduler.maximum-allocation-mb``. Asking for more than this will result in the request being rejected. The heapsize defined as -Xmx of ``site.global.jvm_args`` in ``appConfig.json``, is used by the Presto JVM itself. Slider suggests that the value of ``yarn.memory`` must be bigger than this heapsize. The value of ``yarn.memory`` MUST be bigger than the heap size allocated to any JVM and Slider suggests using atleast 50% more appears to work, though some experimentation will be needed. In addition, set other memory specific properties ``presto_query_max_memory`` and ``presto_query_max_memory_per_node`` in ``appConfig.json`` as you would set the properties ``query.max-memory`` and ``query.max-memory-per-node`` in Presto's config.properties. 2. CPU Slider also supports configuring the YARN virtual cores to use for the process which can be defined per component. ``yarn.vcores`` declares the number of "virtual cores" to request. Ask for more vcores if your process needs more CPU time. See http://slider.incubator.apache.org/docs/configuration/resources.html#core for more details. 3. CGroups in YARN If you are using CPU scheduling (using the DominantResourceCalculator), you should also use CGroups to constrain and manage CPU processes. CGroups compliments CPU scheduling by providing CPU resource isolation. With CGroups strict enforcement turned on, each CPU process gets only the resources it asks for. This way, we can guarantee that containers hosting Presto services is assigned with a percentage of CPU. If you have another process that needs to run on a node that also requires CPU resources, you can lower the percentage of CPU allocated to YARN to free up resources for the other process. See Hadoop documentation on how to configure CGroups in YARN: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html. Once you have CGroups configured, Presto on YARN containers will be configured in the CGroups hierarchy like any other YARN application containers. Slider can also define YARN queues to submit the application creation request to, which can set the priority, resource limits and other values of the application. But this configuration is global to Slider and defined in ``conf/slider-client.xml``. You can define the queue name and also the priority within the queue. All containers created in the Slider cluster will share this same queue. :: slider.yarn.queue default slider.yarn.queue.priority 1 Failure Policy ~~~~~~~~~~~~~~ Follow this section if you want to change the default Slider failure policy. Yarn containers hosting Presto may fail due to some misconfiguration in Presto or some other conflicts. The number of times the component may fail within a failure window is defined in ``resources.json``. The related properties are: 1. The duration of a failure window, a time period in which failures are counted. The related properties are ``yarn.container.failure.window.days``, ``yarn.container.failure.window.hours``, ``yarn.container.failure.window.minutes`` and should be set in the global section as it relates just to slider. The default value is ``yarn.container.failure.window.hours=6``. The initial window is measured from the start of the slider application master —once the duration of that window is exceeded, all failure counts are reset, and the window begins again. 2. The maximum number of failures of any component in this time period. ``yarn.container.failure.threshold`` is the property for this and in most cases, should be set proportional to the the number of instances of the component. For Presto clusters, where there will be one coordinator and some number of workers it is reasonable to have a failure threshold for workers more than that of coordinator. This is because a higher failure rate of worker nodes is to be expected if the cause of the failure is due to the underlying hardware. At the same time the threshold should be low enough to detect any Presto configuration issues causing the workers to fail rapidly and breach the threshold sooner. These failure thresholds are all heuristics. When initially configuring an application instance, low thresholds reduce the disruption caused by components which are frequently failing due to configuration problems. In a production application, large failure thresholds and/or shorter windows ensures that the application is resilient to transient failures of the underlying YARN cluster and hardware. Based on the placement policy there are two more failure related properties you can set. 1. The configuration property ``yarn.node.failure.threshold`` defines how "unreliable" a node must be before it is skipped for placement requests. This is only used for the default yarn.component.placement.policy where unreliable nodes are avoided. 2. ``yarn.placement.escalate.seconds`` is the timeout after which slider will escalate the request of pending containers to be launched on other nodes. For strict placement policy where the requested components are deployed on all nodes, this property is irrelevant. For other placement policies this property is relevant and the higher the cost of migrating a component instance from one host to another, the longer value of escalation timeout is recommended. Thus slider will wait longer before the component instance is escalated to be started on other nodes. During restart, for cases where redeploying the component instances on the same node as before is beneficial (due to locality of data or similar reasons), a higher escalation timeout is recommended. Take a look here: http://slider.incubator.apache.org/docs/configuration/resources.html#failurepolicy for more details on failure policy. .. _using-yarn-label: Using YARN label ~~~~~~~~~~~~~~~~ This is an optional feature and is not required to run Presto in YARN. To guarantee that a certain set of nodes are reserved for deploying Presto or to configure a particular node for a component type we can make use of YARN label expressions. 1. First assign the nodes/subset of nodes with appropriate labels. See http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_yarn_resource_mgt/content/ch_node_labels.html 2. Then set the components in ``resource.json`` with ``yarn.label.expression`` to have labels to be used when allocating containers for Presto. 3. Create the application using ``bin/slider create .. --queue ``. ``queuename`` will be the queue defined in step one for the appropriate label. If a label expression is specified for the slider-appmaster component then it also becomes the default label expression for all component. Sample ``resources.json`` may look like: .. code-block:: none "COORDINATOR": { "yarn.role.priority": "1", "yarn.component.instances": "1", "yarn.component.placement.policy": "1", "yarn.label.expression":"coordinator" }, "WORKER": { "yarn.role.priority": "2", "yarn.component.instances": "2", "yarn.component.placement.policy": "1", "yarn.label.expression":"worker" } where coordinator and worker are the node labels created and configured with a scheduler queue in YARN