Python Code Text Analysis Worksheet
Text ProcessingModify the script so that it does at least two of the following for better comparisons:Choose two literary works (e.g. Pride and Prejudice, Jane Eyre) and use them to recommend similar works (“If you like Pride and Prejudice, you will also like these works…”). Your program should go through the remaining works and list them under the choice with the higher similarity index. For example, if Dracula is more similar to Jane Eyre than Pride and Prejudice, Dracula should be listed under Jane Eyre.Modify your recommendation program so that it reports the titles of the works rather than their file names. To do this, write a program that reads in the titles.txt file and creates a dictionary that looks up the title using the file name. This dictionary should then be used to report the works by their title instead of their file name.Code:import osimport mathdef count_word(table, word): ‘for the word entry in the table, increment its count or init to 1′ if word in table: table[word] += 1 else: # initialize count of word to 1 table[word] = 1def analyze(): ”’read all texts from the docs folder, report similarity comparisons among all pairs”’ doc_table = dict() word_set = set() os.chdir(‘docs’) fileList = os.listdir() for fname in fileList: print(“Opening ” + fname) fd = open(fname, “r”, encoding=”utf8″) doc_table[fname] = dict() data = fd.read() print(“splitting”) dataList = data.split() print(“{} has {} words”. format(fname, len(dataList))) for word in dataList: word_set.add(word) count_word(doc_table[fname], word) fd.close() os.chdir(‘..’) # return to parent directory for fname in fileList: for fname2 in fileList: sim = similarity(doc_table[fname], doc_table[fname2], word_set) print(“{:.2f} : {} vs. {}”.format(sim, fname, fname2))def build_title_file(): “creates titles.txt based on works in the docs folder” tfd = open(“titles.txt”, “w”) os.chdir(‘docs’) fileList = os.listdir() for fname in fileList: print(“Opening ” + fname) fd = open(fname, “r”, encoding=”utf8″) for line in fd: if “Title: ” in line: tfd.write(fname + “n”) tfd.write(line[7:]) break fd.close() os.chdir(“..”) # return to parent directory tfd.close()def similarity(tableA, tableB, words): ‘return cosine similarity between tableA and tableB over all words’ ab = 0 a2 = 0 b2 = 0 for w in words: ab += tableA.get(w, 0) * tableB.get(w, 0) a2 += tableA.get(w, 0) * tableA.get(w, 0) b2 += tableB.get(w, 0) * tableB.get(w, 0) return ab / (math.sqrt(a2) * math.sqrt(b2))TXT file: alice_in_wonderland.txtAliceís Adventures in Wonderlanddracula.txtDraculafrankenstein.txtFrankensteinjane_eyre.txtJane Eyremoby_dick.txtMoby Dick; or The Whalepride_and_prejudice.txtPride and Prejudicetale_of_two_cities.txtA Tale of Two Citiesudolpho.txtThe Mysteries of Udolphowizard_of_oz.txtThe Wonderful Wizard of Oz
