Using Jsoup to Parse HTML

Slides:



Advertisements
Similar presentations
Taking Input Java Md. Eftakhairul Islam
Advertisements

Foundations of Programming and Problem Solving Introduction.
1 JavaCUP JavaCUP (Construct Useful Parser) is a parser generator Produce a parser written in java, itself is also written in Java; There are many parser.
Exceptions Any number of exceptional circumstances may arise during program execution that cause trouble import java.io.*; class IOExample { public static.
Slides prepared by Rose Williams, Binghamton University ICS201 Exception Handling University of Hail College of Computer Science and Engineering Department.
Java Intro. A First Java Program //The Hello, World! program in Java public class Hello { public static void main(String[] args) { System.out.println("Hello,
OOP&M - laboratory lectures1 OOP&M – LAB2 LABzwei: the Input.
CSC 171 – FALL 2004 COMPUTER PROGRAMMING LECTURE 0 ADMINISTRATION.
1 Streams Overview l I/O streams l Opening a text file for reading l Reading a text file l Closing a stream l Reading numbers from a text file l Writing.
Java I/O – what does it include? Command line user interface –Initial arguments to main program –System.in and System.out GUI Hardware –Disk drives ->
COMP 14 Introduction to Programming Miguel A. Otaduy May 14, 2004.
Java Review 2. The Agenda The following topics were highlighted to me as issues: –File IO (Rem) –Wrappers (Rem) –Interfaces (Rem) –Asymptotic Notation.
1 Lecture#8: EXCEPTION HANDLING Overview l What exceptions should be handled or thrown ? l The syntax of the try statement. l The semantics of the try.
Shorthand operators.
Teach.NET Workshop Series Track 4: AP Computer Science with.NET and J#
Example 1 :- Handling integer values public class Program1 { public static void main(String [] args) { int value1, value2, sum; value1 = Integer.parseInt(args[0]);
Georgia Institute of Technology Speed part 3 Barb Ericson Georgia Institute of Technology May 2006.
Java and C++, The Difference An introduction Unit - 00.
CSI 1390: Introduction to Computers TA: Tapu Kumar Ghose Office: STE 5014
1 Java Console I/O Introduction. 2 Java I/O You may have noticed that all the I/O that we have done has been output The reasons –Java I/O is based on.
The string data type String. String (in general) A string is a sequence of characters enclosed between the double quotes "..." Example: Each character.
Console Input. So far… All the inputs for our programs have been hard-coded in the main method or inputted using the dialog boxes of BlueJ It’s time to.
Chapter 5: Preparing Java Programs 1 Chapter 5 Preparing Java Programs.
Basic Java Programming CSCI 392 Week Two. Stuff that is the same as C++ for loops and while loops for (int i=0; i
Programming Concept Chapter I Introduction to Java Programming.
BUILDING JAVA PROGRAMS CHAPTER 2 Days of For Loops Past.
CS61B L02 Using Objects (1)Garcia / Yelick Fall 2003 © UCB Kathy Yelick Handout for today: These lecture notes Computer Science 61B Lecture 2 – Using Objects.
Programming for Beginners Martin Nelson Elizabeth FitzGerald Lecture 11: Handling Errors; File Input/Output.
1 Recitation 8. 2 Outline Goals of this recitation: 1.Learn about loading files 2.Learn about command line arguments 3.Review of Exceptions.
Lec.10 + (Chapter 8 & 9) GUI Java Applet Jiang (Jen) ZHENG July 6 th, 2005.
CIS 260: App Dev I. 2 Programs and Programming n Program  A sequence of steps designed to accomplish a task n Program design  A detailed _____ for implementing.
Lecture.1: Getting Started With Java Jiang (Jen) ZHENG May 9 th, 2005.
Introduction IS Outline  Goals of the course  Course organization  Java command line  Object-oriented programming  File I/O.
Mixing integer and floating point numbers in an arithmetic operation.
CS101 Lab “File input/Output”. File input, output File : binary file, text file READ/WRITE class of “text file” - File Reading class : FileReader, BufferedReader.
Overview of Java CSCI 392 Day One. Running C code vs Java code C Source Code C Compiler Object File (machine code) Library Files Linker Executable File.
Using the while-statement to process data files. General procedure to access a data file General procedure in computer programming to read data from a.
The assignment expressions. The assignment operator in an assignment statement We have seen the assignment statement: Effect: var = expr; Stores the value.
CSI 3125, Preliminaries, page 1 Java I/O. CSI 3125, Preliminaries, page 2 Java I/O Java I/O (Input and Output) is used to process the input and produce.
FOR LOOP WALK THROUGH public class NestedFor { public static void main(String [] args) { for (int i = 1; i
Introduction to array: why use arrays ?. Motivational example Problem: Write a program that reads in and stores away 5 double numbers After reading in.
CSCI 1100/1202 January 23, Class Methods Some methods can be invoked through the class name, instead of through an object of the class These methods.
The ++ and -- expressions. The ++ and -- operators You guessed it: The ++ and -- are operators that return a value.
Computer Science A 1. Course plan Introduction to programming Basic concepts of typical programming languages. Tools: compiler, editor, integrated editor,
Reading Parameters CSCI 201 Principles of Software Development Jeffrey Miller, Ph.D.
Introduction to programming in java
CSE 341, S. Tanimoto Java networking-
Chapter No. : 1 Introduction to Java.
Software Development Packages
Introduction to Exceptions in Java
RADE new features via JAVA
Lecture Note Set 1 Thursday 12-May-05
Creating and Modifying Text part 2
Accessing Files in Java
Testing and Exceptions
Jim Gerland CS4HS June 28, 2017 Buffalo State College
Basic Protocols 24-Nov-18.
Reading and Writing Text Files
File I/O ICS 111: Introduction to Computer Science I
L3. Necessary Java Programming Techniques
L3. Necessary Java Programming Techniques
Java Intro.
CS110D Programming Language I
Basic Protocols 19-Feb-19.
មជ្ឈមណ្ឌលកូរ៉េ សហ្វវែរ អេច អ ឌី
Exceptions.
Chap 4. Programming Fundamentals
F II 2. Simple Java Programs Objectives
Exceptions.
Presentation transcript:

Using Jsoup to Parse HTML csci 392

Background : Parse Trees A = B * C + 15; assignment var = expr ; A + * 15 B C

Background : HTML DOM The Document Object Model (DOM) is a cross-platform and language-independent application programming interface that treats an HTML, XHTML, or XML document as a tree structure wherein each node is an object representing a part of the document. The objects can be manipulated programmatically and any visible changes occurring as a result may then be reflected in the display of the document. -- wikipedia.com

Background : HTML tags <h3 align=center>This is a Header</h3> element attribute and value text

Using Jsoup go to jsoup.org/download To compile To run save the .jar file into your directory with your source code To compile javac -cp jsoup-1.10.3.jar test1.java To run java -cp .:jsoup-1.10.3.jar test1

Jsoup packages org.jsoup org.jsoup.examples org.jsoup.helper org.jsoup.nodes org.jsoup.parser org.jsoup.safety org.jsoup.select

import org. jsoup. Jsoup; import org. jsoup. nodes import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.io.*; public class links1 { public static void main(String[] args) throws IOException // get URL from the user System.out.print ("URL to Process: "); InputStreamReader stdio = new InputStreamReader(System.in); BufferedReader keyboard = new BufferedReader (stdio); String url = keyboard.readLine(); // go get the HTML file Document doc = Jsoup.connect(url).get(); Elements links = doc.select("a[href]"); // print all links System.out.println("There are " + links.size() + " links:"); for (Element link : links) System.out.println( link.attr("href") + "--" + link.text() ); } link.attr("abs:href")

Output of links1 > java -cp .:jsoup-1.10.3.jar links1 URL to Process: http://faculty.winthrop.edu/dannellys/ There are 19 links: http://www.winthrop.edu-- http://www.winthrop.edu/cba/CSQM/--Department of Computer Science and Quantitative Methods http://www.winthrop.edu/cba--College of Business Administration mailto:dannellys@winthrop.edu--dannellys@winthrop.edu office/office_fall16.htm--Fall 2016 Office Hours and Class Schedule csci101/default.htm--CSCI 101 - Intro csci297/default.htm--CSCI 297 - Scripting Languages csci392/default.htm--CSCI 392 - Java csci208lab/default.htm--CSCI 208 Lab csci271/default.htm--CSCI 271 - Data Structures csci293/default.htm--CSCI 293 - C# csci325/default.htm--CSCI 325 - File Structures csci327/default.htm--CSCI 327 - Social Implications of Computing csci440/default.htm--CSCI 440 - Computer Graphics csci521/default.htm--CSCI 521 - Software Project Management csci621/default.htm--CSCI 621 - Software Project Mgmt csci626/default.htm--CSCI 626 - SQA mgmt661/default.htm--MGMT 661 - Management Info Systems WinthropDay2016.pptx--CSQM Winthrop Day Presentation

import org. jsoup. Jsoup; import org. jsoup. nodes import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.io.*; public class links2 { public static void main(String[] args) throws IOException Document doc = null; Elements links = null; Element link = null; // get URL from the user System.out.print ("URL to Process: "); InputStreamReader stdio = new InputStreamReader(System.in); BufferedReader keyboard = new BufferedReader (stdio); String url = keyboard.readLine(); // did it include http:// at the beginning? if ( url.indexOf("http://") != 0 ) url = "http://" + url;

// go get the HTML file try { doc = Jsoup. connect(url) // go get the HTML file try { doc = Jsoup.connect(url).get(); } catch (Exception e) System.out.println("Error: Unable to get that URL."); return; String title = doc.title(); System.out.println("title is: " + title); // print all links links = doc.select("a[href]"); System.out.println("There are " + links.size() + " links:"); for (int i=0; i<links.size(); i++) link = links.get(i); System.out.println("\"" + link.text() + "\"\t --> " + link.attr("href"));

Output of Links2 > java -cp .:jsoup-1.10.3.jar links2 URL to Process: faculty.winthrop.edu/dannellys There are 19 links: "" --> http://www.winthrop.edu "Department of Computer Science and Quantitative Methods" --> http://www.winthrop.edu/cba/CSQM/ "College of Business Administration" --> http://www.winthrop.edu/cba "dannellys@winthrop.edu" --> mailto:dannellys@winthrop.edu "Fall 2016 Office Hours and Class Schedule" --> office/office_fall16.htm "CSCI 101 - Intro" --> csci101/default.htm "CSCI 297 - Scripting Languages" --> csci297/default.htm "CSCI 392 - Java" --> csci392/default.htm "CSCI 208 Lab" --> csci208lab/default.htm "CSCI 271 - Data Structures" --> csci271/default.htm "CSCI 293 - C#" --> csci293/default.htm "CSCI 325 - File Structures" --> csci325/default.htm "CSCI 327 - Social Implications of Computing" --> csci327/default.htm "CSCI 440 - Computer Graphics" --> csci440/default.htm "CSCI 521 - Software Project Management" --> csci521/default.htm "CSCI 621 - Software Project Mgmt" --> csci621/default.htm "CSCI 626 - SQA" --> csci626/default.htm "MGMT 661 - Management Info Systems" --> mgmt661/default.htm "CSQM Winthrop Day Presentation" --> WinthropDay2016.pptx