使用spark对网站的浏览情况进行统计分析,生成数据会输出到HDFS上。这边使用的数据源文件是nginx日志。tmp.log
ngnix的access.log的格式,摘抄部分日志
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
   |  127.0.0.1 - - [05/Sep/2018:23:18:22 +0800] "GET /4DAnalog/clashreport/delete HTTP/1.1" 502 575 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36" 127.0.0.1 - - [05/Sep/2018:23:18:22 +0800] "GET /favicon.ico HTTP/1.1" 502 575 "http://localhost:8080/4DAnalog/clashreport/delete" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36" 127.0.0.1 - - [05/Sep/2018:23:18:40 +0800] "GET /4DAnalog/clashreport/find HTTP/1.1" 502 575 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36" 127.0.0.1 - - [05/Sep/2018:23:18:40 +0800] "GET /favicon.ico HTTP/1.1" 502 575 "http://localhost:8080/4DAnalog/clashreport/find" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36" 127.0.0.1 - - [05/Sep/2018:23:18:42 +0800] "GET /4DAnalog/clashreport/find HTTP/1.1" 502 575 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36" 127.0.0.1 - - [05/Sep/2018:23:18:42 +0800] "GET /favicon.ico HTTP/1.1" 502 575 "http://localhost:8080/4DAnalog/clashreport/find" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36" 127.0.0.1 - - [05/Sep/2018:23:18:43 +0800] "GET /4DAnalog/clashreport/find HTTP/1.1" 502 575 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36" 127.0.0.1 - - [05/Sep/2018:23:18:43 +0800] "GET /favicon.ico HTTP/1.1" 502 575 "http://localhost:8080/4DAnalog/clashreport/find" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36" 127.0.0.1 - - [05/Sep/2018:23:18:43 +0800] "GET /4DAnalog/clashreport/find HTTP/1.1" 502 575 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36" 127.0.0.1 - - [05/Sep/2018:23:18:44 +0800] "GET /favicon.ico HTTP/1.1" 502 575 "http://localhost:8080/4DAnalog/clashreport/find" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36" 127.0.0.1 - - [05/Sep/2018:23:18:52 +0800] "GET /4DAnalog/clashreport/delete HTTP/1.1" 502 575 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36" 127.0.0.1 - - [05/Sep/2018:23:18:53 +0800] "GET /favicon.ico HTTP/1.1" 502 575 "http://localhost:8080/4DAnalog/clashreport/delete" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36" 127.0.0.1 - - [05/Sep/2018:23:18:59 +0800] "GET /4DAnalog/chat/delete HTTP/1.1" 502 575 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36" 127.0.0.1 - - [05/Sep/2018:23:18:59 +0800] "GET /favicon.ico HTTP/1.1" 502 575 "http://localhost:8080/4DAnalog/chat/delete" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36"
 
 
  | 
 
前期准备
需要提前准备好tmp.log上传到hdfs文件系统上
hdfs dfs -put ~/tmp.log /urlcount/
环境搭建及代码编写
1.创建maven项目
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
   | <?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0"          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"          xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">     <modelVersion>4.0.0</modelVersion>
      <groupId>com.zonegood</groupId>     <artifactId>hellospark</artifactId>     <version>1.0-SNAPSHOT</version>
      <properties>         <maven.compiler.source>1.8</maven.compiler.source>         <maven.compiler.target>1.8</maven.compiler.target>         <encoding>UTF-8</encoding>         <scala.version>2.10.6</scala.version>         <scala.compat.version>2.10</scala.compat.version>     </properties>
      <dependencies>         <dependency>             <groupId>org.scala-lang</groupId>             <artifactId>scala-library</artifactId>             <version>${scala.version}</version>         </dependency>
          <dependency>             <groupId>org.apache.spark</groupId>             <artifactId>spark-core_2.10</artifactId>             <version>1.5.2</version>         </dependency>
          <dependency>             <groupId>org.apache.spark</groupId>             <artifactId>spark-streaming_2.10</artifactId>             <version>1.5.2</version>         </dependency>
          <dependency>             <groupId>org.apache.hadoop</groupId>             <artifactId>hadoop-client</artifactId>             <version>2.6.2</version>         </dependency>     </dependencies>
      <build>         <sourceDirectory>src/main/scala</sourceDirectory>         <testSourceDirectory>src/test/scala</testSourceDirectory>         <plugins>             <plugin>                 <groupId>net.alchim31.maven</groupId>                 <artifactId>scala-maven-plugin</artifactId>                 <version>3.2.0</version>                 <executions>                     <execution>                         <goals>                             <goal>compile</goal>                             <goal>testCompile</goal>                         </goals>                         <configuration>                             <args>                                 <arg>-make:transitive</arg>                                 <arg>-dependencyfile</arg>                                 <arg>${project.build.directory}/.scala_dependencies</arg>                             </args>                         </configuration>                     </execution>                 </executions>             </plugin>             <plugin>                 <groupId>org.apache.maven.plugins</groupId>                 <artifactId>maven-surefire-plugin</artifactId>                 <version>2.18.1</version>                 <configuration>                     <useFile>false</useFile>                     <disableXmlReport>true</disableXmlReport>                     <includes>                         <include>**/*Test.*</include>                         <include>**/*Suite.*</include>                     </includes>                 </configuration>             </plugin>
              <plugin>                 <groupId>org.apache.maven.plugins</groupId>                 <artifactId>maven-shade-plugin</artifactId>                 <version>2.3</version>                 <executions>                     <execution>                         <phase>package</phase>                         <goals>                             <goal>shade</goal>                         </goals>                         <configuration>                             <filters>                                 <filter>                                     <artifact>*:*</artifact>                                     <excludes>                                         <exclude>META-INF/*.SF</exclude>                                         <exclude>META-INF/*.DSA</exclude>                                         <exclude>META-INF/*.RSA</exclude>                                     </excludes>                                 </filter>                             </filters>                             <transformers>                                 <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">                                     <mainClass>com.zomegood.hellospark.WordCount</mainClass>                                 </transformer>                             </transformers>                         </configuration>                     </execution>                 </executions>             </plugin>         </plugins>     </build> </project>
 
  | 
 
如果没有src/main/scala目录,需要手动创建

2.新建伴生对象com.zomegood.UrlCount.Main.scala
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
   | package com.zomegood.UrlCount
  import org.apache.spark.{SparkConf, SparkContext}
 
 
 
 
 
  object Main {
      def main(args : Array[String]) : Unit = {         val conf = new SparkConf().setAppName("UrlCount")         val sc = new SparkContext(conf)                           var rdd1 = sc.textFile(args(0)).map(_.split(" ")).map(arr => (arr(6),1));                  rdd1.reduceByKey(_+_).saveAsTextFile(args(1));         sc.stop()     }
  }
 
  | 
 
3.使用maven打jar包
运行
mvn clean package
以集群方式运行
bin/spark-submit --class com.zomegood.UrlCount.Main --master spark://cor1:7077 --executor-memory 512m --total-executor-cores 2 ../spark-mvn-1.0-SNAPSHOT.jar hdfs://cor1:9000/urlcount/tmp.log hdfs://cor1:9000/urlcount/out

使用saveAsTextFile运行结果存到hdfs上
