News Search

This project contains the code for searching through our dataset of news, blog posts, Twitter and Facebook data using Hadoop.

Structure

src/ - the source code for Hadoop search job (in packet edu.stanford.snap.spinn3rHadoop)
- Search.java - the main and run method for the Hadoop job
- Spinn3rInputFilter.java - decides which input files to process depending on the search date limitations
- Spinn3rMaper.java - mapper for Hadoop job: parse the record to Java object, checks if it satisfies search conditions and if it does write it to output
- utils/ParseCLI.java - parsing command line arguments
- utils/DocumentFilter.java - check if one document satisfies search contitions
- utils/Spinn3rDocument.java - converting strings to objects and back
target/ - the Hadoop search job deployed to a runnable JAR
web-page/ - the folder for everything connected with the web page
- api/ - the REST api for communication with Hadoop client
- client/ - the client for communication with the api on the web server and for running the jobs on Hadoop cluster
- lib/ - library of common stuff
- www/ - the HTML and PHP code for the web page

Usage

To run the search use hadoop jar <path to the JAR> <additional arguments>. There are several arguments required. To see help, just run the command without any additional arguments.

Development information

To get the development environment up and running follow these steps:

make sure maven is installed
to add the Cloudera repository put the following configuration into the ${HOME}/.m2/settings.xml file:

<?xml version="1.0" encoding="UTF-8"?>
<settings>
    <profiles>
        <profile>
            <id>standard-extra-repos</id>
            <activation>
                <activeByDefault>true</activeByDefault>
            </activation>
            <repositories>
                <repository>
                    <!-- Central Repository -->
                    <id>central</id>
                    <url>http://repo1.maven.org/maven2/</url>
                    <releases>
                        <enabled>true</enabled>
                    </releases>
                    <snapshots>
                        <enabled>true</enabled>
                    </snapshots>
                </repository>
                <repository>
                    <!-- Cloudera Repository -->
                    <id>cloudera</id>
                    <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
                    <releases>
                        <enabled>true</enabled>
                    </releases>
                    <snapshots>
                        <enabled>true</enabled>
                   </snapshots>
                </repository>
            </repositories>
        </profile>
    </profiles>
</settings>

clone the project, navigate into its folder and run mvn eclipse:eclipse to make it an Eclipse project and get the dependency JARs
open Eclipse and import it using File > Import > Existing Projects into Workspace then Select root directory and click Finish.
to run the code locally navigate into Search.java and run it as java application

To compile the code into a JAR which can be run on Hadoop cluster use mvn clean package and the JAR will be compiled and put into target directory.

Additional links

The project was built following these instructions.

An other great tutorial could be found here.

Info about CHD5 repository is available here.

Name		Name	Last commit message	Last commit date
Latest commit History 170 Commits
src		src
web-page		web-page
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

web-page

web-page

.gitignore

.gitignore

README.md

README.md

pom.xml

pom.xml

Repository files navigation

News Search

Structure

Usage

Development information

Additional links

About

Releases 1

Packages

Contributors 4

Languages

snap-stanford/news-search

Folders and files

Latest commit

History

Repository files navigation

News Search

Structure

Usage

Development information

Additional links

About

Resources

Stars

Watchers

Forks

Languages