Hadoop 2.x distributed cluster environment construction

Hadoop 2.x distributed cluster environment construction

introduction

Big data is already a very, very hot technology direction in this era. All industries are using big data to improve their business, including many physical industries and manufacturing companies, all hoping to use existing data or it can be crawled. data. Unearth more commercial value. My driving school coach once told me about a scenario, can we use big data and artificial intelligence technology to analyze local people who want to learn a car, test a driver's license, and buy a car. Then push the enrollment information or advertisement of the driving school directly to the prospective customers through certain channels (SMS, email, telephone, XX post bar, forum, app, etc.). We all know that for any technical project, R&D is to improve business and reduce costs, especially for traditional R&D. Cost reduction is the most core purpose. Because the early stage: market research, business product planning, strategic deployment, etc. are more about market expansion and analysis of competitors. It's a bit far away. Back to big data, we must first have a good framework tool, and that is Hadoop, which we will introduce today. At the same time, to do big data development based on Hadoop, the environment must be set up.

CentOS virtual machine

1. VMware installation

Not to mention the installation of VMware, it is very simple, the next step is just the next step. Download link : www.vmware.com/products/wo... VMware 14 version, I use the windows version.

2. CentOS installation machine NAT network configuration

CentOS uses the 6.5 version

  • CentOS mirror selection

    Select the downloaded system image and click Next.

  • Hard disk size configuration

    The size of the disk here can be 20G, of course, your hard disk is bigger than the sky, and it does not prevent you from setting a 1P, hahaha. The minimum is not less than 5G, and the period is not enough. Choose to store the virtual disk as a single file. This is to facilitate the porting of the created virtual machine to other computers, which is commonly known as "someone else's home where I copied the computer."

  • Custom hardware configuration

Other hardware is unchanged, the memory is modified to 2G, because the current configuration is the master server node, and the memory required is larger. This depends on the size of your own physical machine memory. My machine is 16G, and my allocation is 422: the master node machine has 4G memory, and the other two slave nodes have 2G memory. Normal 211 configuration is enough, of course, it is better to be larger in actual work.

  • Wait for the installation to complete

  • NAT network configuration

Next, configure the virtual machine's network, click Edit --- select "Virtual Network Editor"

Select VMnet8-->Click Remove Network-->Then click Add Network.

Also select VMnet8, VM will automatically assign a subnet for us.

After the subnet is allocated, a VMnet8 subnet item will also appear, and the allocated network segment is 241. Then select the NAT mode, click "NAT Settings", you can view the information of the subnet:

ip: 192.168.241.0, subnet mask: 255.255.255.0, gateway: 192.168.241.2

These are needed for later configuration, make a record first

Click the settings of the computer icon in the lower right corner to first switch the network adapter to bridge mode. After confirming, click the settings of the computer icon to switch back to NAT mode to initialize the virtual machine network.

Terminal input command:

  1. Enter cd/etc/sysconfig/network-scripts

  2. vim ifcfg-eth0

  3. Enter and save information such as ip, subnet mask, gateway, etc.

  4. Enter/etc/init.d/network restart to restart the network service.

  5. Ping www.baidu.com , as shown in the figure below, you can access the network.

3. Use of Command Line Artifact SecureCRT

It is really troublesome to directly operate the VM virtual machine. Using the SecureCRT command line artifact to connect to the virtual machine can be very convenient for operation. After installing SecureCRT, open Session Manager-->New Session-->The ip of the lowering master machine as the host input, the user name is your configuration. When you connect, you will be asked to enter the password. The single save password is convenient for future connections without having to enter the password again.

4. Configure two slave machines

It's very simple, first suspend the master virtual machine, then find the path folder of the virtual machine's local disk, copy the two folders and rename them as shown below:

Then open the two slave node virtual machines and modify the ip of the ifcfg-eth0 file under/etc/sysconfig/network-scripts of the two slave node virtual machines to: slave1 (ip: 192.168.241.11), slave21 (ip: 192.168) .241.12) Because it is a copy of the master node, the physical addresses of the network cards of the two slave nodes are the same. You can use VMware's magical setting of the small computer icon in the lower right corner. First remove the network adapter (network card), and then add it. network adapter. Switch to "Bridge Mode", wait for the small icon to light up again after confirmation, and click the settings of the small computer icon again to change the network adapter connection to "NAT Mode". This completes the reinitialization of the service of the NAT mode network card.
Finally, use: ping www.baidu.com to test whether the slave node machines can access the Internet.

Similarly, in SecureCRT, create two session machines for slave nodes. The steps are the same as those of 3master.

jdk installation

I use the 1.8 version of jdk, which can be downloaded from the oracle official website. 1. to copy files under Windows to CentOS, a shared folder is required. Select the computer icon settings, in the option of the shared folder, enable file sharing and add shared files.

  • Use the cp copy command to copy the jdk1.8 compressed package to the/usr/local/src/java directory, first create the java directory:

    After copying mkdir/usr/local/src/java
    cp jdk*/usr/local/src/java
    , enter/usr/local/src/java and decompress the jdk compressed package through the tar -zxvf jdk* command.

  • Edit and configure jdk environment variables

    vim ~/.bashrc

    export JAVA_HOME=/usr/local/src/java/jdk1.8.0_181
    export CLASSPATH=:$CLASSPATH:$JAVA_HOME/lib
    export PATH=:$PATH:$JAVA_HOME/bin

    source ~/.bashrc to update the environment variable file (this file represents the environment variable of the current user group)

  • Configure the slave node jdk and its environment variables

    Copy the decompressed jdk file to the other two slave nodes:

    1. scp -rp/usr/local/src/java 192.168.241.11:/usr/local/src/
      (Use the scp remote server copy command to copy the java folder to the slave node of the ip 192.168.241.11, which is/usr/of slave1 local/src/directory)
    2. scp -rp/usr/local/src/java 192.168.241.12:/usr/local/src/
      copy to slave2 node machine
    3. Also modify the environment configuration of ~/.bashrc, and then update the environment variables through the source ~/.bashrc command. The following figure shows that the jdk environment is successfully installed.

Hadoop cluster installation

1. Hadoop installation and configuration

  • The same as the jdk installation and configuration, copy the hadoop2.6.1 compressed package to the/usr/local/src/hadoop/directory and unzip it.
  • Modify the etc directory of the configuration file hadoop
  1. Create new tmp folders, dfs/name, dfs/data and other folders in the/usr/local/src/hadoop/hadoop-2.6.1 directory.

  2. Modify the configuration file:

    vim hadoop-env.sh

    export JAVA_HOME=/usr/local/src/java/jdk1.8.0_152

    vim yarn-env.sh

    export JAVA_HOME=/usr/local/src/java/jdk1.8.0_152

    vim slaves

    slave1
    slave2

    vim core-site.xml

    <configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://192.168.241.10:9000</value>
        </property>
        <property>
            <name>hadoop.tmp.dir</name>
            <value>file:/usr/local/src/hadoop/hadoop-2.6.1/tmp</value>
        </property>
    </configuration>
     

    vim hdfs-site.xml

    <configuration>
        <property>
            <name>dfs.namenode.secondary.http-address</name>
            <value>master:9001</value>
        </property>
        <property>
            <name>dfs.namenode.name.dir</name>
            <value>file:/usr/local/src/hadoop/hadoop-2.6.1/dfs/name</value>
        </property>
        <property>
            <name>dfs.datanode.data.dir</name>
            <value>file:/usr/local/src/hadoop/hadoop-2.6.1/dfs/data</value>
        </property>
        <property>
            <name>dfs.repliction</name>
            <value>3</value>
        </property>
    </configuration>
     

    vim mapred-site.xml

    <configuration>
        <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
        </property>
    </configuration>
     

    vim yarn-site.xml

    <configuration>
        <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
        </property>
            <property>
            <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
            <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
            <property>
            <name>yarn.resourcemanager.address</name>
            <value>master:8032</value>
        </property>
            <property>
            <name>yarn.resourcemanager.scheduler.address</name>
            <value>master:8030</value>
        </property>
            <property>
            <name>yarn.resourcemanager.resource-tracker.address</name>
            <value>master:8035</value>
        </property>
            <property>
            <name>yarn.resourcemanager.admin.address</name>
            <value>master:8033</value>
        </property>
            <property>
            <name>yarn.resourcemanager.webapp.address</name>
            <value>master:8088</value>
        </property>
    </configuration>
     
  3. Add hadoop to environment variables

    vim ~/.bashrc

    export HADOOP_HOME=/usr/local/src/hadoop/hadoop-2.6.1
    export PATH=$PATH:$HADOOP_HOME/bin

    source ~/.bashrc Update environment variables

2. Cluster machine node network configuration

Configure the network of the three machines into their respective/etc/hosts files.

Master 192.168.241.10
192.168.241.11 slave1
192.168.241.12 slave2

At the same time, configure the domain names of the three machines: master, slave1, and slave2 into the network file

Vim/etc/sysconfig/Network
HOSTNAME = Master
HOSTNAME = Slave1
HOSTNAME = slave2

And use the hostname and bash commands and the above domain name will take effect, as shown below:

3. Mutual secret-free login switch configuration between machine nodes

  • Turn off the firewalls of all machine nodes to avoid unnecessary errors when starting hadoop, and it is difficult to troubleshoot problems

    /etc/init.d/iptables stop
    setenforce 0

  • Establish mutual trust and secret-free login between machine nodes

    ssh-keygen
    cd ~/.ssh (Enter the hidden ssh directory)

  • Copy the encrypted string in the id_rsa.pub public key file to the authorized_keys file.

    Copy the encrypted string in the ssh id_rsa.pub public key file of other node machines to the authorized_keys file. Here my cluster has three node machines, then there are three encrypted strings in the authorized_keys file, as shown below:

  • Verify whether the free login via ssh mutual trust is successful

    ssh slave1

4. Run test in hadoop environment

  • Start the cluster and initially format the Namenode

    hadoop namenode -format

    ./start-all.sh

  • Check whether all cluster processes are up

    jps command

    master

    slave1

    slave2

    This means that the Hadoop distributed cluster environment has been successfully built! ! !