I was surprised not to be able to google an answer to this so I want to record my findings here. To add (a.k.a. commision) a node to Hadoop cluster that should be used only for map-reduce tasks and not for storing data, you have multiple options:
- Do not start the datanode service on the node
- If you’ve configured Hadoop to allow only nodes on its whitelist files to connect to it then add it to the file pointed to by the property mapred.hosts but not to the file in dfs.hosts.
- Otherwise add the node to the DFS’ blacklist, i.e. file pointed to by the property dfs.hosts.exclude and execute
hadoop dfsadmin -refreshNodeson the namenode to apply it.
#3 is what I did as we weren’t using #2.
When the datanode and tasktracker services start on the new node, they will try to register with the namenode and jobtracker. If the node is on the DFS exclude list then its datanode will not be allowed to connect and consequently won’t be used to store data while map-reduce tasks will be allowed to run on it.
You can set (previously unset) property dfs.hosts.exclude in hdfs-site.xml without restarting the namenode service, it will pick it up anyway (likely when running -refreshNodes). Notice that it should contain path to a file on the local filesystem at the namenode server.