MicroStrategy ONE
Starting in MicroStrategy 2021 Update 4, Hadoop Gateway is no longer supported.
Environment Considerations
Security on Data Access (Authentication)
Accessing your Cluster services may be controlled by a compliant Kerberos implementation (Kerberos MIT, Active Directory). In a Kerberos environment, the MicroStrategy Hadoop Gateway can identify itself as a Kerberos principal and have access to the required services: HDFS, Spark Manager.
Hadoop as an Edge or Proxy Mode of the Cluster
We recommend the MicroStrategy Hadoop Gateway host to be part of the Hadoop cluster for security, administration and performance benefit. An Edge or Proxy node is physically or logically located within the cluster, and contains the same set of libraries.
From an administration point of view, any upgrade on the cluster library version will include the Edge or Proxy node. It benefits performance, as the data transfer speed should be higher. It improves security, as the node could be constrained to the same rules and authentication.
High Availability Mode in HDFS and YARN Cluster Services
Following best practices, the cluster may have implemented High Availability (HA) mode on the services. A server node can be set as Active and enabled while an additional can be set to Standby to replace the Active at any time. An HA environment would use a different set of properties when referring to these services. Review your environment and make sure it runs in HA mode.
System Requirements and Supported Configurations
The system requirements for a MicroStrategy Hadoop Gateway are the same as for a Spark cluster. The supported Spark version is 1.6.x. Supported distribution version for MicroStrategy Hadoop Gateway is Cloudera Data Hub 5.10 or above, and Hortonworks 2.4 or above.
For Cluster environments with a standard authentication mechanism, the MicroStrategy Hadoop Gateway can be operated in Local, YARN client and Spark Standalone mode. For environments with Kerberos authentication enabled, the MicroStrategy Hadoop Gateway can only be operated in YARN client mode.
The following are needed on a Hadoop cluster:
-
You should have a Hadoop environment installed on Unix/Linux servers.
The Hadoop Cluster must have at least the HDFS service installed. Other services that can be installed include Hive, Hue, Oozie, and ZooKeeper.
- MicroStrategy Hadoop Gateway supports the High Availability mode of NameNode and YARN Resource Manager.
To enable NameNode High Availability, see How to Browse the Hadoop Distributed File System and Preview Files. No extra configuration is required for YARN Resource Manager High Availability. It is handled automatically.
- If you are using the MicroStrategy Hadoop Gateway in YARN client mode, the Hadoop Cluster should have YARN and Spark services installed.
- If you are using MicroStrategy Hadoop Gateway on Spark Standalone mode, the Hadoop cluster should have Spark (Standalone) service installed.
- Connectivity parameters to the Spark master (for example spark://SparkMasterNode:7077)
- Cloudera Manager does not allow this service to be installed if the cluster has Kerberos enabled.
- For troubleshooting purposes:
- Access to Spark Standalone website http://SparkMasterNode:18080
- Access to Spark History Server website http://SparkHistoryServerNode:18088
- Access to YARN resource monitor website http://YARNResourceManagerNode:8088
-
Make sure Spark service is installed and configured properly. MicroStrategy Hadoop Gateway 10.11 and later will be launched with spark-submit methodology to avoid issues with cluster environment compliance and compatibility.
Ports Used by MicroStrategy Hadoop Gateway
From |
To: Service Default Port |
Explanation |
---|---|---|
Intelligence Server |
MicroStrategy Hadoop Gateway Host Port 30004 |
Sending commands from the Intelligence Server to MicroStrategy Hadoop Gateway to fetch data. The port number is configurable in MicroStrategy Hadoop Gateway configuration file: /conf/hgos-spark.properties. |
Intelligence Server |
MicroStrategy Hadoop Gateway Host Port 4020 |
Port used by Intelligence Server to browse HDFS via MicroStrategy Hadoop Gateway RESTful service. The port number is configurable in MicroStrategy Hadoop Gateway configuration file: /conf/hgos-spark.properties. |
MicroStrategy Hadoop Gateway |
HDFS NameNode Port 8020 |
Default port number is: 8020. Please contact your cluster administrators for specific port number. |
HDFS (all nodes of cluster Hadoop) |
Intelligence Server Port 30241 |
Used to send query result set from MicroStrategy Hadoop Gateway Spark application worker nodes to Intelligence Server. The port number is configurable in the OS registry where the Intelligence Server is installed. Registry key: HKEY_LOCAL_MACHINE/SOFTWARE/Wow6432Node/MicroStrategy/DSS Server/Castor/DSPort Registry file in Linux: MSIReg.reg |
MicroStrategy Hadoop Gateway |
YARN Resource Manager Port 8032 |
YARN connectivity |
MicroStrategy Hadoop Gateway |
Spark Port 4040 |
Spark connectivity |
MicroStrategy Hadoop Gateway |
Kerberos KDC Port 88 |
To authenticate MicroStrategy Hadoop Gateway to access other services (such as HDFS). |
If Kerberos Authentication Has Been Enabled
To learn about Kerberos installation, see: How to Install the Kerberos Authentication Service.
Please refer to following links about how to enable Kerberos authentication in Cloudera CDH and Hortonworks HDP cluster.
You will need a Kerberos principal (or SPN in Active Directory) to authenticate your MicroStrategy Hadoop Gateway process.
The Kerberos authentication happens in at least two events:
- Browsing the HDFS file directory to select files to import. MicroStrategy Hadoop Gateway directly connects NameNode.
- Starting the MicroStrategy Hadoop Gateway on YARN-client mode: MicroStrategy Hadoop Gateway will deploy Spark applications across YARN and requires a Kerberos ticket for this.
MicroStrategy Hadoop Gateway should be executed with a valid Linux user account linked to a Kerberos principal. It could have any name, but for convention we will refer to it as hgos/<HadoopGatewayHostFQDN>@REALM_NAME. As any other Cluster account, this account should be able to log into all machines of the cluster.
This account should be allowed to log into HDFS with write privileges in its home directory (for example hdfs://NameNode:8020/user/hgos).
Cluster nodes should have required libraries to work as a Kerberos client (these could be the packages krb5-workstation, openldap-client).
If High Availability Mode Has Been Enabled
Identify the nameservice of the HDFS service.
The following are needed on a MicroStrategy Hadoop Gateway driver machine:
- Host OS: Linux (recommended: CentOS-7).
- The host to be part of the CDH cluster as a proxy node or worker node.
- Java Runtime Environment version 1.7 or 1.8 (latest subversion available) installed.
- Linux account must have write and execute privileges in the installation folder.
- The OS account should have an assigned user folder in HDFS and read/write privileges (for example hdfs://<HDFSNameNode:8020>/user/<Principal name>/) (a temp directory .sparkStaging will be created).
- Connectivity parameters, IP address and the port to connect from Intelligence Server.
- For detailed logs, replace the log4j.properties file with the richer version available in the troubleshooting section.
If Kerberos Authentication Has Been Enabled
- The host should have installed Kerberos client libraries (like krb5-workstation) and allow Kerberos commands like kinit or klist.
- The Java Runtime Environment should have the Java Cryptography Extension libraries to support aes-256 encryption. There libraries are available at Oracle's website. The Java JCE package contains two JAR libraries. Use these and replace them in the directory <JRE_HOME>/lib/security (If JDK is installed instead of JRE, it should be <JDK_Home>/jre/lib/security. Keep a backup of your original libraries).
For the Intelligence Server host, update firewall and network rules to allow connectivity into port 30241 from cluster worker nodes.
Related Topics
Introduction to the MicroStrategy Hadoop Gateway
How to Deploy the MicroStrategy Hadoop Gateway
How to Start the MicroStrategy Hadoop Gateway